MAAP-Project / Community

Issue for MAAP (Zenhub)
2 stars 1 forks source link

Searching DPS output is slow #364

Open wildintellect opened 3 years ago

wildintellect commented 3 years ago

Is your feature request related to a problem? Please describe. When a user outputs thousands of DPS runs, the dps_output folder is very nested, and searching through all the of runs to build a list of results takes a very long time. Building this list happens very often, for uploading granules to user share, for build mosaics, for collating results in a reduce step, for building a list of outputs as inputs to the next DPS algorithm.

Describe the solution you'd like To find a way to speed up the search and collation of results. The solution found should probably be in the maap-py library to facilitate easy use by users, unless the code is so simple that a page in the documentation is sufficient. From the alternatives I'm suggesting (4) parallel solution using boto3. Happy to discuss other options.

Describe alternatives you've considered

  1. I investigated if the problems was that glob recursive could be run in parallel. Technically it can, however the backend is s3fs which appears to be single-threaded (or at least single cpu), so it doesn't parallelize the listdir operation to s3 which is known to be on the slow side.
  2. DPS could catalog the outputs in more efficient search system when returning the jobs. Sqlite? Dynamodb?
  3. All outputs could be pooled at a higher level, reducing the number of directories required to traverse and call listdir on.
  4. implement a parallel solution using boto3 utilizing threads/cpus in order to created the subdir lists in parallel.

Additional context s3 is known to be a little slow with directory listing, adding s3fs into the mix just makes it worse. Example:

test_files = glob.glob("/projects/r2d2/dps_output/run_rebinning_ubuntu/master/2021/06/16/17/**/**/*.h5", recursive=True)

cc: @lauraduncanson @pahbs

gchang commented 3 months ago

Old ticket, maybe stac will solve this.

rtapella commented 3 months ago

I think number 3 (merge the subfolders so that there's less nesting of the output folders) should also happen.