danielfrg / s3contents

Jupyter Notebooks in S3 - Jupyter Contents Manager implementation
Apache License 2.0
248 stars 88 forks source link

Listing contents of large s3 folders is slow #140

Open yoel-ross-zip opened 2 years ago

yoel-ross-zip commented 2 years ago

Hey,

Thanks for your work on this library, iv'e been using it for a while and its really nice.

Recently i ran into some issues with long load time for large s3 folders. I believe this is the result of repeated synchronous calls to the abstract lstat method. I have done some testing, and found that if making these calls with asyncio, using the s3fs._info method instead really speeds things up (like 20X faster on large folders).

I'm currently using a fork i made with these changes, and it works great. I opened a PR for you to consider: https://github.com/danielfrg/s3contents/pull/139

I use this library quite a bit, and would be happy to put in the work to get this change merged.

Thanks again!

Joe

danielfrg commented 2 years ago

Fixed thanks to your PR :) Thanks!

aleny91 commented 2 years ago

@ziprjoe @danielfrg First of all, many thanks for your precious work! 😄 I've just installed this new modified version, because I noticed the same problem working with large directories. Sadly, I'm facing an error. It seems that the file .s3keep is present in the bucket only at the highest level, but not in the subdirectories where it is also searched. Any suggestions?

image

yoel-ross-zip commented 2 years ago

Hey, should be a matter of catching the exception and ignoring it. In cases where there is no s3keep file, there isn't a way to show the last update time, so a dummy date will be displayed. this PR should fix it: https://github.com/danielfrg/s3contents/pull/143

fakhavan commented 1 year ago

@ziprjoe @danielfrg Firstly, I'd like to express my gratitude for your excellent work on this library. It has been incredibly useful for my use-case of connecting s3 with Jhub compared to the alternatives.

However, I've encountered an issue when using s3contents to connect to an S3 bucket with pre-existing directories. These directories aren't displayed in the UI unless I manually add a .s3keep file to each directory. Once I do this, the issue is resolved. I'm wondering if you are aware of the cause of this problem and if there's a way to use s3contents with a bucket that has pre-existing directories without having to manually add .s3keep files to each directory.

Thank you for your time and attention!

danielfrg commented 1 year ago

Hi @ziproje.

I think there are new ways to handle directories in S3 that do not require the placeholder files. I have not tested and to be honest I am not using this lib anymore.

I try to keep it updated but since I am not using it, it is behind on needed features and I dont expect I will be able to add new features in the near future. I basically just handle new releases from contributors at this point.

fbaldo31 commented 6 months ago

@ziprjoe @danielfrg Firstly, I'd like to express my gratitude for your excellent work on this library. It has been incredibly useful for my use-case of connecting s3 with Jhub compared to the alternatives.

However, I've encountered an issue when using s3contents to connect to an S3 bucket with pre-existing directories. These directories aren't displayed in the UI unless I manually add a .s3keep file to each directory. Once I do this, the issue is resolved. I'm wondering if you are aware of the cause of this problem and if there's a way to use s3contents with a bucket that has pre-existing directories without having to manually add .s3keep files to each directory.

Thank you for your time and attention!

I handle that with a script called in postStart lifecycle hook

file=$HOME/.dir.txt
# Save s3 directory tree
aws s3 ls --recursive s3://<bucket> | cut -c32- | xargs -d '\n' -n 1 dirname | uniq > $HOME/.dir.txt
touch .s3keep

while IFS= read -r folder; do
    aws s3 cp .s3keep s3://<bucket>/$folder/.s3keep
done < "$file"