astronomy-commons / hipscat-import

HiPSCat import - generate HiPSCat-partitioned catalogs
https://hipscat-import.readthedocs.io
BSD 3-Clause "New" or "Revised" License
5 stars 2 forks source link

Implement file-list-batch style catalog import #308

Closed delucchi-cmu closed 1 month ago

delucchi-cmu commented 2 months ago

Feature request

PLACEHOLDER.

There are a lot of details I'm glossing over. I'll write up more later.

Before submitting Please check the following:

troyraen commented 1 month ago

If I'm interpreting the title correctly, I think the feature request is:

Add docs and code to the file readers module that shows how to send lists of input files to the reader and have the reader concatenate data from multiple files if necessary to yield chunks with at least x number of rows. This should reduce the number of files in the intermediate dataset in cases where the input files are small and numerous.

For reference, a recent import of the ZTF lightcurves resulted in an intermediate dataset with 4.4 million files. The import took several days to run and multiple things went wrong at different stages, including obscure but crucial problems with the compute nodes. The large number of files made it practically impossible to verify what was actually on disk at any given time. This was especially hard after some of the intermediate files were deleted during the reducing step and I ended up just having to start over completely.

troyraen commented 1 month ago

As I recall, @delucchi-cmu recommended that the lists be sized so that there are 50-100 lists per worker. One list of input files per worker is not recommended because it prevents the pipeline from being able to skip any of the previously completed input files when resuming the splitting step.