ThouCheese / cloud-storage-rs

A crate for uploading files to Google cloud storage, and for generating download urls.
MIT License
124 stars 87 forks source link

[PERF] list object with partial response support #87

Closed cboudereau closed 2 years ago

cboudereau commented 2 years ago

Hello.

Fixes #88

I would like to contribute and discuss to support partial response object from gcs listing in order to speed up the listing on large bucket/prefix.

Here is the documentation explaining the performance optimization: https://cloud.google.com/storage/docs/json_api#partial-response

FYI I tried with a bucket which contains more than 60K objects. It took 1m40s with full object response rather than 40s with only the selfLink field in Object.

To test it, in the example, you can comment in/out the line which contains the new field option on the request parameter

Please, consider this PR as an experiment to open the discussion and see how we can integrate partial response support.

Sorry for the crappy version but I would like to bench first this implementation to replace our custom implementation.

Thanks in advance

ThouCheese commented 2 years ago

So how could we design this is a way that is ergonomic and safe to use? From what I understand the fields parameter controls which fields are populated in the JSON response sent by the GCS API. Then a 'general' response would need to have each of its fields be an option when the fields parameter is supported. This would essentially mean opting out of Rust's nice null-safety guarantees, i.e. the programmer has to make sure that when they access a field on cloud_storage struct, they also include that field in the fields parameter. Do you have any thoughts on what a convenient and safe API for this feature could look like?

cboudereau commented 2 years ago

Hello.

Thank you for your quick feedback. I tried in the first version to make everything optional in the Object and ListObject but we loose some mandatory information like bucket for instance.

Let me know if you have an idea in order integrate the feature. I will try to find out a solution for the coming week.

By the way do not hesitate to add commit or though in the PR 🙂

On Sat, Sep 25, 2021, 22:49 Luuk Wester @.***> wrote:

So how could we design this is a way that is ergonomic and safe to use? From what I understand the fields parameter controls which fields are populated in the JSON response sent by the GCS API. Then a 'general' response would need to have each of its fields be an option when the fields parameter is supported. This would essentially mean opting out of Rust's nice null-safety guarantees, i.e. the programmer has to make sure that when they access a field on cloud_storage struct, they also include that field in the fields parameter. Do you have any thoughts on what a convenient and safe API for this feature could look like?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ThouCheese/cloud-storage-rs/pull/87#issuecomment-927180961, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAP6WV6VZOAJXDICZSGDAADUDYYUPANCNFSM5EWMXQ6A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

cboudereau commented 2 years ago

Hello @ThouCheese. The PR is ready, I just left name, bucket and self_link mandatory and the performance is still improved as if there was only self_link. That way, we have the performance optimization without breaking all the API.

ThouCheese commented 2 years ago

Hi @cboudereau, I have left some comments behind on the changes, but I also don't see how you addressed the question I raised earlier about the safety of the API. It looks like you converted most fields to Option<T>, but this still means that type information gets erased. If we cannot come up with something better I suggest we create PartialObject, PartialListRequest and Object::list_partial, and then use the less type-safe version there. That way is it more opt-in.

cboudereau commented 2 years ago

Hello @ThouCheese.

Got it.

I pushed this morning a version where we are backward compatible regarding the API. Agree, having a dedicated list_partial is better now. I also fixed the line break.

ThouCheese commented 2 years ago

Hi! Getting close, but a couple of remarks:

  1. list_partial is still missing from the sync client,
  2. The docs for list_partial and PartialObject should tell the user why they are different from list and Object, and also ListRequest still says that the struct is used for Object::list,
  3. Do you think that list_request.fields could be a Vec of some enum, say ObjectField, so that the user cannot make typos when listing the object? That could be a nice extra safe api.
  4. There should be tests for these new functions.
cboudereau commented 2 years ago

Hi! Getting close, but a couple of remarks:

  1. list_partial is still missing from the sync client, done
  2. The docs for list_partial and PartialObject should tell the user why they are different from list and Object, and also ListRequest still says that the struct is used for Object::list, done
  3. Do you think that list_request.fields could be a Vec of some enum, say ObjectField, so that the user cannot make typos when listing the object? That could be a nice extra safe api. I don't think so since it is more like a query and not used anywhere else
  4. There should be tests for these new functions. done
cboudereau commented 2 years ago

Hello @ThouCheese . I replied in a previous comment to the different things I had to change.

I removed the field fields from the ListRequest so that it is impossible to mix partial/default response from the API to avoid any bad behavior.