astronomy-commons / hipscat-import

HiPSCat import - generate HiPSCat-partitioned catalogs
https://hipscat-import.readthedocs.io
BSD 3-Clause "New" or "Revised" License
5 stars 2 forks source link

Creation of many small files before merging #275

Open nevencaplar opened 3 months ago

nevencaplar commented 3 months ago

When creating a new catalog, we create many small files by sharding each input file to conform to the output catalog and then merging all of the files that belong to the same healpix pixel. Explore a better way to do it (dask.shuffle?) without having to write many small files, which slows down the process.

troyraen commented 1 month ago

One thing I want to try on my next import is to consolidate the shards (per pixel) generated from a single input file before returning from the split_pixels function here. This should help for large input files that get split into many chunks by the reader. For small input files, I'll try #308 and then this consolidation should help with those as well.

So the same number of intermediate files would be written initially but they'd immediately be reduced so that the next steps can deal with a smaller number of files. This should help not only with the final "reducing" step, but also, a) make it easier to verify the intermediate dataset, which I'm planning for #118; and b) if/when something goes wrong with the import it will be easier to figure out what's actually on disk and then resume either splitting or reducing instead of starting over.