astronomy-commons / hipscat-import

HiPSCat import - generate HiPSCat-partitioned catalogs
https://hipscat-import.readthedocs.io
BSD 3-Clause "New" or "Revised" License
5 stars 2 forks source link

Enable "index file" reads for catalog import #334

Closed delucchi-cmu closed 1 month ago

delucchi-cmu commented 1 month ago

Change Description

Closes #308 .

Solution Description

Creates a new kind of file reader for catalog import: indexed file reader. This uses a single "index" file as a task unit, and these files contain only paths to data files to be read. This enables batching many small input data files into larger chunks for the map and reduce stages of the pipeline.

Implements an indexed reader for CSV and for Parquet files. In particular, the parquet reader utilizes pyarrow's parquet read batch_readahead, fragment_readahead, and multi-threading to further speed up reads of many small data files.

Code Quality

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.79%. Comparing base (04c1a8f) to head (c45a6dc).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #334 +/- ## ======================================= Coverage 99.78% 99.79% ======================================= Files 26 26 Lines 1389 1442 +53 ======================================= + Hits 1386 1439 +53 Misses 3 3 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.