flow-php / flow

Flow PHP - data processing framework
https://flow-php.com
MIT License
470 stars 26 forks source link

Parquet - Change the way of how dictionary page header is detected while reading column chunks #1005

Closed norberttech closed 6 months ago

norberttech commented 6 months ago

Change Log

Added

Fixed

  • Change the way of how dictionary page header is detected while reading column chunks

Changed

Removed

Deprecated

Security


Description

This PR should solve issue mentioned here: https://github.com/flow-php/flow/issues/984#issuecomment-1949891775 Turned out that some files generated by spark do not set properly the metadata, because of that dictionary page header is not properly recognized and without it whole column can't be readed. This approach reads the first header and if it's a dictionary header it's using it to read the column dictionary.

github-actions[bot] commented 6 months ago

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors ```shell +-----------------------+-------------------+------+-----+------------------+------------------+-----------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +-----------------------+-------------------+------+-----+------------------+------------------+-----------------+ | AvroExtractorBench | bench_extract_10k | 1 | 3 | 35.279mb +0.00% | 818.370ms -1.56% | ±0.90% +88.60% | | CSVExtractorBench | bench_extract_10k | 1 | 3 | 5.002mb +0.01% | 343.164ms +0.99% | ±1.27% +496.01% | | JsonExtractorBench | bench_extract_10k | 1 | 3 | 5.152mb +0.01% | 1.047s -0.53% | ±0.89% +161.64% | | ParquetExtractorBench | bench_extract_10k | 1 | 3 | 135.818mb +0.00% | 907.455ms -0.30% | ±0.86% -57.95% | | TextExtractorBench | bench_extract_10k | 1 | 3 | 4.910mb +0.01% | 35.813ms +2.07% | ±0.33% -74.22% | | XmlExtractorBench | bench_extract_10k | 1 | 3 | 4.915mb +0.01% | 430.578ms -0.87% | ±0.38% -82.52% | +-----------------------+-------------------+------+-----+------------------+------------------+-----------------+ ```
Transformers ```shell +-----------------------------+--------------------------+------+-----+------------------+-----------------+-----------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +-----------------------------+--------------------------+------+-----+------------------+-----------------+-----------------+ | RenameEntryTransformerBench | bench_transform_10k_rows | 1 | 3 | 110.616mb +0.00% | 64.049ms -1.07% | ±1.00% +618.19% | +-----------------------------+--------------------------+------+-----+------------------+-----------------+-----------------+ ```
Loaders ```shell +--------------------+----------------+------+-----+------------------+------------------+-----------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +--------------------+----------------+------+-----+------------------+------------------+-----------------+ | AvroLoaderBench | bench_load_10k | 1 | 3 | 95.659mb +0.00% | 474.671ms +1.97% | ±0.40% -67.12% | | CSVLoaderBench | bench_load_10k | 1 | 3 | 54.141mb +0.00% | 72.022ms +0.31% | ±1.23% +393.13% | | JsonLoaderBench | bench_load_10k | 1 | 3 | 106.567mb +0.00% | 53.243ms +0.79% | ±0.68% -53.43% | | ParquetLoaderBench | bench_load_10k | 1 | 3 | 224.385mb +0.00% | 1.418s -1.13% | ±0.44% -1.85% | | TextLoaderBench | bench_load_10k | 1 | 3 | 17.957mb +0.00% | 39.541ms -1.35% | ±0.15% +140.56% | +--------------------+----------------+------+-----+------------------+------------------+-----------------+ ```
Building Blocks ```shell +-------------------------+----------------------------+------+-----+------------------+------------------+------------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +-------------------------+----------------------------+------+-----+------------------+------------------+------------------+ | TypeDetectorBench | bench_type_detector | 1 | 3 | 59.958mb +0.00% | 438.047ms +0.87% | ±0.39% -28.70% | | TypeDetectorBench | bench_type_detector | 1 | 3 | 14.497mb +0.00% | 87.822ms +0.41% | ±0.69% +16.44% | | NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 116.714mb +0.00% | 494.389ms +1.02% | ±0.62% -22.72% | | NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 60.192mb +0.00% | 251.271ms +2.73% | ±0.66% -17.51% | | NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 15.127mb +0.00% | 52.971ms +0.67% | ±1.29% +523.90% | | RowsBench | bench_chunk_10_on_10k | 2 | 3 | 76.682mb +0.00% | 3.575ms -1.41% | ±2.03% -32.19% | | RowsBench | bench_diff_left_1k_on_10k | 2 | 3 | 96.409mb +0.00% | 179.097ms -1.07% | ±0.46% -44.65% | | RowsBench | bench_diff_right_1k_on_10k | 2 | 3 | 74.935mb +0.00% | 18.489ms +1.71% | ±1.00% +30.03% | | RowsBench | bench_drop_1k_on_10k | 2 | 3 | 77.922mb +0.00% | 2.160ms +22.22% | ±3.32% +99.85% | | RowsBench | bench_drop_right_1k_on_10k | 2 | 3 | 77.922mb +0.00% | 2.231ms +28.85% | ±2.51% -1.98% | | RowsBench | bench_entries_on_10k | 2 | 3 | 75.035mb +0.00% | 2.882ms +11.10% | ±2.57% +16.27% | | RowsBench | bench_filter_on_10k | 2 | 3 | 75.563mb +0.00% | 15.233ms +3.03% | ±1.27% +8.83% | | RowsBench | bench_find_on_10k | 2 | 3 | 75.563mb +0.00% | 14.510ms -4.41% | ±2.42% +70.96% | | RowsBench | bench_find_one_on_10k | 10 | 3 | 73.468mb +0.00% | 2.000μs +4.93% | ±0.00% -100.00% | | RowsBench | bench_first_on_10k | 10 | 3 | 73.468mb +0.00% | 0.400μs 0.00% | ±0.00% 0.00% | | RowsBench | bench_flat_map_on_1k | 2 | 3 | 87.022mb +0.00% | 13.747ms +3.13% | ±3.05% +299.54% | | RowsBench | bench_map_on_10k | 2 | 3 | 116.383mb +0.00% | 66.137ms -4.40% | ±0.30% -68.37% | | RowsBench | bench_merge_1k_on_10k | 2 | 3 | 76.083mb +0.00% | 1.362ms -6.63% | ±3.49% +2160.02% | | RowsBench | bench_partition_by_on_10k | 2 | 3 | 79.430mb +0.00% | 57.351ms -7.48% | ±0.61% +137.64% | | RowsBench | bench_remove_on_10k | 2 | 3 | 78.185mb +0.00% | 3.872ms -6.79% | ±0.60% -69.49% | | RowsBench | bench_sort_asc_on_1k | 2 | 3 | 73.546mb +0.00% | 39.761ms -1.57% | ±0.72% -73.02% | | RowsBench | bench_sort_by_on_1k | 2 | 3 | 73.546mb +0.00% | 41.990ms +0.55% | ±3.17% +38.35% | | RowsBench | bench_sort_desc_on_1k | 2 | 3 | 73.546mb +0.00% | 39.959ms -0.87% | ±1.09% -8.54% | | RowsBench | bench_sort_entries_on_1k | 2 | 3 | 75.909mb +0.00% | 7.348ms -0.01% | ±0.43% -51.19% | | RowsBench | bench_sort_on_1k | 2 | 3 | 73.468mb +0.00% | 29.552ms +2.15% | ±0.77% +261.89% | | RowsBench | bench_take_1k_on_10k | 10 | 3 | 73.468mb +0.00% | 13.707μs +0.78% | ±2.28% +90.23% | | RowsBench | bench_take_right_1k_on_10k | 10 | 3 | 73.468mb +0.00% | 16.010μs +0.57% | ±2.76% +212.77% | | RowsBench | bench_unique_on_1k | 2 | 3 | 96.476mb +0.00% | 184.816ms -1.04% | ±1.33% +57.80% | +-------------------------+----------------------------+------+-----+------------------+------------------+------------------+ ```