Access to large buckets (many objects in a hierarchy) is slow

jagane-opensource commented 4 years ago

Reading the s3 bucket using the following snippet all_objects = [obj for obj in bucket.objects.filter(Prefix=path)]

The bucket.objects.filter(Prefix=path)] takes a long time for buckets that have millions of objects. This is because we don't process it hierarchically using a separator. Switching to the list_objects_v2 api makes things much faster. However, this implies certain things - for example, we must necessarily browse directories by sending a path with a trailing "/" and get objects using a path that does not include a trailing "/". Additionally, it feels like the browser components are the correct place to store the information as to whether a path is a directory or a file. In fact, we send this information to jupyterlab when we provide it with a listing. The patch included here does the following:

Switch to lower level boto3 api - list_objects_v2 and get_object
Push the responsibility for adding a / to the end of directory lookups back to the typesecript code that runs in the browser pane
Turn off auto restore of last browse location because when jupyterlab calls us back it doesn't tell us whether the path is a directory or a file

jagane-opensource commented 4 years ago

This patch may be applied using: patch -p1 < patch-issue-13.txt patch-issue-13.txt

jagane-opensource commented 4 years ago

Here's a slightly improved version of the patch - more concise code, better named functions, etc. patch-issue-13-v2.txt

reevejd commented 4 years ago

Hi Jagane, thanks for the patch, this is great! The performance is indeed dramatically better with your changes. However, I've noticed that it doesn't properly fetch objects that are nested a few "folders" deep (i.e. objects with prefixes like path/to/my/object). I should have time to look into it further this Friday.

Would you like to open up a PR so you're properly recorded as the contributor for this improvement? If not, I can make the changes myself and list you as a contributor in the readme.

jagane-opensource commented 4 years ago

You are right, James - there was a bug in my patch that caused hierarchical listings to not work correctly. I fixed that now.

I actually created a PR on github - PR#13 and I have the latest version of the patch uploaded there. I also renamed a misnamed method and made the code more concise.

Jagane

On Wed, Apr 8, 2020 at 7:42 PM James Reeve notifications@github.com wrote:

Hi Jagane, thanks for the patch, this is great! The performance is indeed dramatically better with your changes. However, I've noticed that it doesn't properly fetch objects that are nested a few "folders" deep (i.e. objects with prefixes like path/to/my/object). I should have time to look into it further this Friday.

Would you like to open up a PR so you're properly recorded as the contributor for this improvement? If not, I can make the changes myself and list you as a contributor in the readme.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/IBM/jupyterlab-s3-browser/issues/13#issuecomment-611296442, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOUO6DDW2Y6CZR7BNE7XM5LRLUYZBANCNFSM4MDQZAWQ .

reevejd commented 4 years ago

Hi Jagane,

Thanks for making the fix, but I don't see your PR. Can you try creating it again?

jagane-opensource commented 4 years ago

Hello James

When you say PR, do you mean the 'Issue' feature in github, or do you folks have a different bug tracking system? I apologize if this is described somewhere else, I must have missed it.

Thanks Jagane

On Thu, Apr 9, 2020 at 5:44 AM James Reeve notifications@github.com wrote:

Hi Jagane,

Thanks for making the fix, but I don't see your PR. Can you try creating it again?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/IBM/jupyterlab-s3-browser/issues/13#issuecomment-611507451, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOUO6DAOYRDMAHHCJO6T3QDRLW7MNANCNFSM4MDQZAWQ .

reevejd commented 4 years ago

Hi Jagane,

In GitHub jargon a PR or Pull Request is a request to merge code from one branch into another. So on your fork of this repository, you can click Pull Request (or just go directly to this link: https://github.com/jagane-opensource/jupyterlab-s3-browser/pull/new/master)

On the next page you can see a diff of all your changes and you can click the Create pull request to make it visible to me.

jagane-opensource commented 4 years ago

Aha - I was thinking that PR means Problem Report, an archaic term for bug reports. I just created the github Pull Request, you should see it soon.

Best Jagane

On Thu, Apr 9, 2020 at 10:17 AM James Reeve notifications@github.com wrote:

Hi Jagane,

In GitHub jargon a PR or Pull Request is a request to merge code from one branch into another. So on your fork of this repository https://github.com/jagane-opensource/jupyterlab-s3-browser, you can click Pull Request (or just go directly to this link: https://github.com/jagane-opensource/jupyterlab-s3-browser/pull/new/master ) [image: open-a-pr] https://user-images.githubusercontent.com/12180828/78921993-c98c7d80-7a63-11ea-8984-f8de1e5409c3.png

On the next page you can see a diff of all your changes and you can click the Create pull request to make it visible to me.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/IBM/jupyterlab-s3-browser/issues/13#issuecomment-611646591, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOUO6DEP3KBRBN3VMRUA4A3RLX7KHANCNFSM4MDQZAWQ .

reevejd commented 4 years ago

Thanks for opening the PR, let's continue discussion in the PR itself: https://github.com/IBM/jupyterlab-s3-browser/pull/14#pullrequestreview-391023593

reevejd commented 4 years ago

Version 0.6.0 of this extension is now available with the fix for this issue from your PR. Thanks again for your contribution.

IBM / jupyterlab-s3-browser

Access to large buckets (many objects in a hierarchy) is slow #13