Closed delucchi-cmu closed 1 month ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 99.79%. Comparing base (
04c1a8f
) to head (c45a6dc
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Change Description
Closes #308 .
Solution Description
Creates a new kind of file reader for catalog import: indexed file reader. This uses a single "index" file as a task unit, and these files contain only paths to data files to be read. This enables batching many small input data files into larger chunks for the map and reduce stages of the pipeline.
Implements an indexed reader for CSV and for Parquet files. In particular, the parquet reader utilizes pyarrow's parquet read
batch_readahead
,fragment_readahead
, and multi-threading to further speed up reads of many small data files.Code Quality