IBM / jupyterlab-s3-browser

A JupyterLab extension for browsing S3-compatible object storage
Apache License 2.0
119 stars 43 forks source link

Performance improvements when accessing buckets with many object #14

Closed jagane-opensource closed 4 years ago

jagane-opensource commented 4 years ago

This set of changes does the following:

  1. Switches to using a lower level boto3 api for accessing s3 buckets: list_objects_v2 and get_object
  2. Push the responsibility for adding a / to the end of directory lookups back to the typesecript code that runs in the browser pane
  3. Turn off auto restore of last browse location because when jupyterlab calls us back it doesn't tell us whether the path is a directory or a file
jagane-opensource commented 4 years ago

Hello James,

Can you provide me access to the bucket that you are seeing problems with? I can test it myself. Otherwise, if you can add some more detail regarding the problems you are seeing, that would be good. I will chase the problem down. Thanks

reevejd commented 4 years ago

I'm not sure how to securely grant you access in a way that would allow you to browse my bucket using this jupyterlab extension, but I hope this is enough to reproduce. I've created a new bucket with a single object with the key prefix1/prefix2/test.txt

Screen Shot 2020-04-09 at 2 49 04 PM

When I try to access it via this extension I get this far:

Screen Shot 2020-04-09 at 2 52 32 PM

And then when I click on prefix1 I get:

Screen Shot 2020-04-09 at 2 52 39 PM
jagane-opensource commented 4 years ago

James - how are you creating these 'directories' or prefixes? I have found a peculiarity in the way Amazon S3 deals with fake directories. If I create these fake directories using the S3 web console and then list them, the S3 REST call list_objects_v2 returns a dummy entry that does not serve us well. So I filter out these fake directory entries and that seems to work well.

I have now tested using two different methods: Method 1: Creating fake dirs using the S3 web console and then copying a file into a directory several levels deep Method 2: By copying a file directly into a several level deep hierarchy. e.g. aws s3 cp onefile.txt s3://testbucket/testlevel1/testlevel2/testlevel3/onefile.txt

Both of the above methods work now. I am still not convinced that I have solved the problem you are seeing - it could some other bug entirely. Can you test with the new change? Also, can you describe how you created the directories prefix1 and prefix2 in your example above?