flow-php / flow

Flow PHP - data processing framework
https://flow-php.com
MIT License
472 stars 26 forks source link

Bypass schema infering when reading from parquet files #975

Closed norberttech closed 7 months ago

norberttech commented 7 months ago

Change Log

Added

  • Converting parquet to flow schema

Fixed

  • Parquet JSON is now stored as BYTE_ARRAY with logical type JSON instead of STRING
  • Parquet UUID is now stored as BYTE_ARRAY with logical type JSON instead of STRING

Changed

Removed

Deprecated

Security


Description

Just as expected, that improves performance of reading from parquet files, sample orders dataset:

Before:

image

After:

image
github-actions[bot] commented 7 months ago

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors ```shell +-----------------------+-------------------+------+-----+-------------------+-------------------+-----------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +-----------------------+-------------------+------+-----+-------------------+-------------------+-----------------+ | AvroExtractorBench | bench_extract_10k | 1 | 3 | 35.271mb +0.00% | 826.162ms +0.56% | ±0.58% -81.29% | | CSVExtractorBench | bench_extract_10k | 1 | 3 | 4.991mb +0.03% | 339.574ms -0.69% | ±0.30% -72.16% | | JsonExtractorBench | bench_extract_10k | 1 | 3 | 5.141mb +0.03% | 1.051s -0.11% | ±0.66% +156.42% | | ParquetExtractorBench | bench_extract_10k | 1 | 3 | 239.845mb -43.38% | 891.002ms -29.02% | ±0.38% -63.91% | | TextExtractorBench | bench_extract_10k | 1 | 3 | 4.905mb +0.03% | 34.861ms -1.19% | ±0.40% -49.93% | | XmlExtractorBench | bench_extract_10k | 1 | 3 | 4.906mb +0.03% | 429.136ms -1.46% | ±0.29% -90.92% | +-----------------------+-------------------+------+-----+-------------------+-------------------+-----------------+ ```
Transformers ```shell +-----------------------------+--------------------------+------+-----+------------------+-----------------+-----------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +-----------------------------+--------------------------+------+-----+------------------+-----------------+-----------------+ | RenameEntryTransformerBench | bench_transform_10k_rows | 1 | 3 | 110.611mb +0.00% | 66.152ms +2.29% | ±1.69% +109.35% | +-----------------------------+--------------------------+------+-----+------------------+-----------------+-----------------+ ```
Loaders ```shell +--------------------+----------------+------+-----+-------------------+------------------+------------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +--------------------+----------------+------+-----+-------------------+------------------+------------------+ | AvroLoaderBench | bench_load_10k | 1 | 3 | 95.652mb +0.00% | 456.701ms -3.08% | ±0.64% +72.71% | | CSVLoaderBench | bench_load_10k | 1 | 3 | 54.127mb +0.00% | 71.835ms +0.39% | ±0.77% -16.86% | | JsonLoaderBench | bench_load_10k | 1 | 3 | 106.556mb +0.00% | 54.748ms +5.46% | ±3.61% +1108.56% | | ParquetLoaderBench | bench_load_10k | 1 | 3 | 321.767mb -30.27% | 1.424s -2.75% | ±1.01% -5.20% | | TextLoaderBench | bench_load_10k | 1 | 3 | 17.950mb +0.01% | 41.034ms +7.06% | ±0.26% +68.32% | +--------------------+----------------+------+-----+-------------------+------------------+------------------+ ```
Building Blocks ```shell +-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+ | RowsBench | bench_chunk_10_on_10k | 2 | 3 | 76.685mb +0.00% | 3.627ms +8.00% | ±2.41% +173.43% | | RowsBench | bench_diff_left_1k_on_10k | 2 | 3 | 96.412mb +0.00% | 183.249ms +0.37% | ±0.77% -43.55% | | RowsBench | bench_diff_right_1k_on_10k | 2 | 3 | 74.938mb +0.00% | 18.171ms +0.05% | ±1.14% +44.00% | | RowsBench | bench_drop_1k_on_10k | 2 | 3 | 77.925mb +0.00% | 1.665ms +0.98% | ±0.06% -94.24% | | RowsBench | bench_drop_right_1k_on_10k | 2 | 3 | 77.925mb +0.00% | 1.693ms +3.43% | ±0.78% -65.85% | | RowsBench | bench_entries_on_10k | 2 | 3 | 75.038mb +0.00% | 2.484ms +2.76% | ±0.79% +32.04% | | RowsBench | bench_filter_on_10k | 2 | 3 | 75.567mb +0.00% | 16.419ms +12.50% | ±0.86% -26.78% | | RowsBench | bench_find_on_10k | 2 | 3 | 75.567mb +0.00% | 17.074ms +18.43% | ±2.03% +66.84% | | RowsBench | bench_find_one_on_10k | 10 | 3 | 73.471mb +0.00% | 1.706μs +0.36% | ±2.72% +0.00% | | RowsBench | bench_first_on_10k | 10 | 3 | 73.471mb +0.00% | 0.400μs +33.33% | ±0.00% +0.00% | | RowsBench | bench_flat_map_on_1k | 2 | 3 | 87.025mb +0.00% | 12.859ms +0.77% | ±0.89% +44.88% | | RowsBench | bench_map_on_10k | 2 | 3 | 116.386mb +0.00% | 63.664ms +0.18% | ±0.30% -40.05% | | RowsBench | bench_merge_1k_on_10k | 2 | 3 | 76.086mb +0.00% | 1.212ms +2.64% | ±0.41% -75.28% | | RowsBench | bench_partition_by_on_10k | 2 | 3 | 79.433mb +0.00% | 58.023ms +2.29% | ±1.44% +216.34% | | RowsBench | bench_remove_on_10k | 2 | 3 | 78.188mb +0.00% | 3.845ms +0.27% | ±2.72% -9.29% | | RowsBench | bench_sort_asc_on_1k | 2 | 3 | 73.549mb +0.00% | 39.747ms +0.08% | ±0.70% +392.67% | | RowsBench | bench_sort_by_on_1k | 2 | 3 | 73.549mb +0.00% | 40.710ms +2.29% | ±0.97% +24.66% | | RowsBench | bench_sort_desc_on_1k | 2 | 3 | 73.549mb +0.00% | 41.383ms +2.10% | ±1.45% -47.65% | | RowsBench | bench_sort_entries_on_1k | 2 | 3 | 75.912mb +0.00% | 7.501ms +3.26% | ±1.39% +252.51% | | RowsBench | bench_sort_on_1k | 2 | 3 | 73.471mb +0.00% | 29.039ms -0.14% | ±0.94% +179.57% | | RowsBench | bench_take_1k_on_10k | 10 | 3 | 73.471mb +0.00% | 13.779μs +2.67% | ±0.90% -3.37% | | RowsBench | bench_take_right_1k_on_10k | 10 | 3 | 73.471mb +0.00% | 15.879μs +1.28% | ±0.78% -1.26% | | RowsBench | bench_unique_on_1k | 2 | 3 | 96.479mb +0.00% | 189.341ms +0.02% | ±1.07% +35.52% | | NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 116.717mb +0.00% | 505.029ms +3.72% | ±1.46% +36.26% | | NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 60.195mb +0.00% | 250.393ms +2.49% | ±0.76% -30.09% | | NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 15.129mb +0.01% | 52.401ms +1.05% | ±1.64% +42.97% | | TypeDetectorBench | bench_type_detector | 1 | 3 | 59.963mb +0.00% | 437.178ms +2.47% | ±0.55% +50.47% | | TypeDetectorBench | bench_type_detector | 1 | 3 | 14.503mb +0.01% | 88.006ms +3.63% | ±1.24% +25.67% | +-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+ ```
norberttech commented 7 months ago

Performance improvements confirmed also by php bench 😁

| ParquetExtractorBench | bench_extract_10k | 1 | 3 | 239.845mb -43.38% | 891.002ms -29.02% | ±0.38% -63.91% |