flow-php / flow

Flow PHP - data processing framework
https://flow-php.com
MIT License
404 stars 23 forks source link

Fixed replacing instance of dataframe when using aggregate method #1057

Closed norberttech closed 2 months ago

norberttech commented 2 months ago

Change Log

Added

Fixed

  • DataFrame::aggregate returns the same instance of dataframe with updated pipeline

Changed

Removed

Deprecated

Security


Description

As much as I hate using reflections I even more hate the idea of exposing Pipelines outside the dataframe. Pipelines are considered to be internal and should not be exposed as it's not something we want to provide BC promise.

I couldn't find a better compromise between keeping the data frame mutable (it's just a big builder for the pipeline) and allowing it to expose partial builders like in this case GroupBy. Since we are just building a pipeline, returning a new instance of its builder is highly problematic since it's already a different object so it can't be referenced anymore.

<?php

$df = df();
$df->read(...);
$df->withEntry(...);
$df->aggregate(...)
// without reflection this would not work anymore
$df->rename(...)
$rows = $df->fetch();

The goal of GroupedDataFrame is to not let users do something like for example DataFrame::groupBy()->run(). That would be highly problematic since it would be a source of memory leaks as grouping without aggregation is just collecting all rows. InnerBuilders for more advanced transformations are also improving DX since IDE will guide developers through all possible methods.

github-actions[bot] commented 2 months ago

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors ```shell +-----------------------+-------------------+------+-----+------------------+------------------+-----------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +-----------------------+-------------------+------+-----+------------------+------------------+-----------------+ | AvroExtractorBench | bench_extract_10k | 1 | 3 | 35.290mb +0.00% | 847.026ms -0.53% | ±0.53% +29.39% | | CSVExtractorBench | bench_extract_10k | 1 | 3 | 5.010mb +0.01% | 344.675ms -0.68% | ±0.63% +23.47% | | JsonExtractorBench | bench_extract_10k | 1 | 3 | 5.165mb +0.00% | 1.077s -1.59% | ±0.13% -87.59% | | ParquetExtractorBench | bench_extract_10k | 1 | 3 | 135.837mb +0.00% | 911.921ms -0.96% | ±0.46% -22.65% | | TextExtractorBench | bench_extract_10k | 1 | 3 | 4.925mb +0.01% | 35.490ms -1.39% | ±2.58% +785.80% | | XmlExtractorBench | bench_extract_10k | 1 | 3 | 4.931mb +0.01% | 433.845ms -1.70% | ±0.87% +16.94% | +-----------------------+-------------------+------+-----+------------------+------------------+-----------------+ ```
Transformers ```shell +-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+ | RenameEntryTransformerBench | bench_transform_10k_rows | 1 | 3 | 116.231mb +0.00% | 59.593ms -1.70% | ±0.58% -41.36% | +-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+ ```
Loaders ```shell +--------------------+----------------+------+-----+------------------+------------------+-----------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +--------------------+----------------+------+-----+------------------+------------------+-----------------+ | AvroLoaderBench | bench_load_10k | 1 | 3 | 96.676mb +0.00% | 452.794ms -1.90% | ±0.71% +36.10% | | CSVLoaderBench | bench_load_10k | 1 | 3 | 55.151mb +0.00% | 67.738ms -2.68% | ±1.74% +34.47% | | JsonLoaderBench | bench_load_10k | 1 | 3 | 107.585mb +0.00% | 50.570ms -3.24% | ±0.46% -20.00% | | ParquetLoaderBench | bench_load_10k | 1 | 3 | 227.005mb +0.00% | 1.412s -1.37% | ±0.78% +222.28% | | TextLoaderBench | bench_load_10k | 1 | 3 | 17.967mb +0.00% | 37.618ms -4.13% | ±0.23% -80.19% | +--------------------+----------------+------+-----+------------------+------------------+-----------------+ ```
Building Blocks ```shell +-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+ | RowsBench | bench_chunk_10_on_10k | 2 | 3 | 87.051mb +0.00% | 3.309ms -12.97% | ±1.74% +126.82% | | RowsBench | bench_diff_left_1k_on_10k | 2 | 3 | 102.649mb +0.00% | 188.537ms +0.19% | ±1.10% +119.81% | | RowsBench | bench_diff_right_1k_on_10k | 2 | 3 | 85.369mb +0.00% | 18.714ms -2.01% | ±0.59% -56.83% | | RowsBench | bench_drop_1k_on_10k | 2 | 3 | 88.291mb +0.00% | 1.663ms -13.56% | ±0.77% +123.65% | | RowsBench | bench_drop_right_1k_on_10k | 2 | 3 | 88.291mb +0.00% | 1.632ms -19.57% | ±1.32% -59.11% | | RowsBench | bench_entries_on_10k | 2 | 3 | 85.403mb +0.00% | 2.473ms -15.43% | ±3.13% +352.03% | | RowsBench | bench_filter_on_10k | 2 | 3 | 85.932mb +0.00% | 15.949ms -4.59% | ±0.36% -28.03% | | RowsBench | bench_find_on_10k | 2 | 3 | 85.932mb +0.00% | 15.984ms -7.00% | ±1.08% +20.24% | | RowsBench | bench_find_one_on_10k | 10 | 3 | 83.836mb +0.00% | 1.800μs -5.56% | ±0.00% -100.00% | | RowsBench | bench_first_on_10k | 10 | 3 | 83.836mb +0.00% | 0.300μs -25.00% | ±0.00% -100.00% | | RowsBench | bench_flat_map_on_1k | 2 | 3 | 93.186mb +0.00% | 12.535ms -8.56% | ±1.69% +280.64% | | RowsBench | bench_map_on_10k | 2 | 3 | 122.557mb +0.00% | 61.649ms -1.76% | ±0.26% -81.79% | | RowsBench | bench_merge_1k_on_10k | 2 | 3 | 86.452mb +0.00% | 1.223ms -22.95% | ±2.07% -8.62% | | RowsBench | bench_partition_by_on_10k | 2 | 3 | 89.799mb +0.00% | 62.461ms -4.68% | ±0.54% -59.92% | | RowsBench | bench_remove_on_10k | 2 | 3 | 88.553mb +0.00% | 3.839ms -4.62% | ±0.84% -52.79% | | RowsBench | bench_sort_asc_on_1k | 2 | 3 | 83.914mb +0.00% | 39.598ms -3.59% | ±1.20% +49.71% | | RowsBench | bench_sort_by_on_1k | 2 | 3 | 83.915mb +0.00% | 40.761ms -0.52% | ±1.42% -30.34% | | RowsBench | bench_sort_desc_on_1k | 2 | 3 | 83.914mb +0.00% | 40.496ms -2.55% | ±0.28% -67.83% | | RowsBench | bench_sort_entries_on_1k | 2 | 3 | 86.278mb +0.00% | 7.362ms -2.85% | ±0.76% -25.88% | | RowsBench | bench_sort_on_1k | 2 | 3 | 83.836mb +0.00% | 28.908ms -5.60% | ±0.69% +43.27% | | RowsBench | bench_take_1k_on_10k | 10 | 3 | 83.836mb +0.00% | 13.100μs -4.93% | ±0.62% -30.87% | | RowsBench | bench_take_right_1k_on_10k | 10 | 3 | 83.836mb +0.00% | 15.888μs -3.67% | ±0.60% +108.00% | | RowsBench | bench_unique_on_1k | 2 | 3 | 102.651mb +0.00% | 191.701ms -0.32% | ±0.82% +41.67% | | NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 116.729mb +0.00% | 519.883ms -6.72% | ±0.80% -71.20% | | NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 60.207mb +0.00% | 255.641ms -3.98% | ±0.81% +141.91% | | NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 15.140mb +0.00% | 54.415ms -5.18% | ±0.41% -70.00% | | TypeDetectorBench | bench_type_detector | 1 | 3 | 59.966mb +0.00% | 433.937ms -1.68% | ±0.40% +84.71% | | TypeDetectorBench | bench_type_detector | 1 | 3 | 14.505mb +0.00% | 85.211ms -2.81% | ±0.55% -1.01% | +-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+ ```