danielfrg / s3contents

Jupyter Notebooks in S3 - Jupyter Contents Manager implementation
Apache License 2.0
248 stars 88 forks source link

Fix issue https://github.com/danielfrg/s3contents/issues/98 #101

Closed mpermana closed 3 years ago

mpermana commented 4 years ago

Performance is very bad to the point unusable when an S3 folder has many entries, say 1000 files.

When getting a list of S3 objects, we already have most of the information needed to build a directory model. So we can directly build the model.

mpermana commented 4 years ago

@danielfrg Looks like this is only lint error? And I can't figure out where the lint coming from.

danielfrg commented 4 years ago

Ah yeah, this is annoying, i think its because python version or something.

Could you commend that step on the Github actions yml?

ericdill commented 4 years ago

This is going to require updating s3fs to be 0.5.0 or newer (that's when the loop attribute was added)

Couple of questions. You dont necessarily have to know the answer here, but I figured I'd ask before trying to replicate this.

  1. Is this reproducible with GCP? Minio?

If it's at least reproducible with Minio we can potentially guard against regressions in the future without needing manual QA to check

  1. Any thoughts on how to test for improved performance on s3 / minio / GCS?

edit: If you rebase onto current master branch, that linting error is fixed (well, disabled)

ericdill commented 4 years ago

thanks for the PR btw!

mpermana commented 4 years ago
  1. yes its reproducible
  2. the test can be call counts to fstat, the original code calls s3fs fstat too many times unnecessarily, a test case would create 1000 files in a directory, and count the number of calls to fstat, before the code change and after the code change, number of calls should be a lot lower
mpermana commented 4 years ago

Can't figure out why tests are still failing.

mpermana commented 4 years ago

@danielfrg any idea how i can pass the test? its still failing

pmoranga commented 3 years ago

We also see this issue from time to time.

About the questions from @ericdill, the issue seems that we are doing too many calls to AWS S3 API. This could be easily measured profiling the python calls when using s3fs, maybe some cache layer for the metadata could save us here.

Can you add in the regression tests to generate a report with cPython with the amount of calls done from s3fs ?

This is going to require updating s3fs to be 0.5.0 or newer (that's when the loop attribute was added)

Couple of questions. You dont necessarily have to know the answer here, but I figured I'd ask before trying to replicate this.

  1. Is this reproducible with GCP? Minio?

If it's at least reproducible with Minio we can potentially guard against regressions in the future without needing manual QA to check

  1. Any thoughts on how to test for improved performance on s3 / minio / GCS?

edit: If you rebase onto current master branch, that linting error is fixed (well, disabled)

reallyrandom commented 3 years ago

any chance this gets merged to master? I am on s3 and while i only have ~30 files, i am still waiting approx 7 seconds for TTFB on each directory change in the file manager.

danielfrg commented 3 years ago

I know this took forever but I finally merged this changes in https://github.com/danielfrg/s3contents/pull/121.

I will be making a release soon.

Thank you very much for your contribution.