flow-php / flow

Flow PHP - data processing framework
https://flow-php.com
MIT License
491 stars 28 forks source link

Fixed writing column statistics when there is only one row that is null #1245

Closed norberttech closed 1 month ago

norberttech commented 1 month ago

Change Log

Added

Fixed

  • Fixed writing column statistics when there is only one row that is null

Changed

Removed

Deprecated

Security


Description

This issue was reported on our Discord channel by user guiguiolol. Code to reproduce this issue:

<?php

use Flow\Parquet\ParquetFile\Schema;
use Flow\Parquet\ParquetFile\Schema\FlatColumn;
use Flow\Parquet\Writer;

require_once __DIR__ . '/../../../vendor/autoload.php';

$writer = new Writer();

$schema = Schema::with(
    FlatColumn::int32('id'),
);

if (file_exists(__DIR__ . '/test.parquet')) {
    unlink(__DIR__ . '/test.parquet');
}

$writer->write(
    __DIR__ . '/test.parquet',
    $schema,
    [
        [
            'id' => null,
        ]
    ]
);

Parquet files created like this can't be read by other software (confirmed through duckdb).

When min/max buffers are empty we can safely make them null (that's what pyarrow is doing). Below python code to reproduce the same scenario:

import pyarrow as pa
import pyarrow.parquet as pq
import os

# Define the directory and file path
dir_path = os.path.dirname(os.path.realpath(__file__))
file_path = os.path.join(dir_path, 'test.parquet')

# Define the schema with an int32 'id' column that is nullable
schema = pa.schema([
    pa.field('id', pa.int32(), nullable=True)
])

# Prepare the data with 'id' set to None (null)
data = {
    'id': [None]
}

# Create a PyArrow Table with the data and schema
table = pa.Table.from_pydict(data, schema=schema)

# Remove the Parquet file if it already exists
if os.path.exists(file_path):
    os.remove(file_path)

# Write the table to a Parquet file
pq.write_table(table, file_path)
github-actions[bot] commented 1 month ago

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors ```shell +-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+ | CSVExtractorBench | bench_extract_10k | 1 | 3 | 4.602mb +0.00% | 514.006ms -0.10% | ±1.00% +141.27% | | JsonExtractorBench | bench_extract_10k | 1 | 3 | 4.690mb +0.00% | 1.086s +0.38% | ±0.50% -46.68% | | ParquetExtractorBench | bench_extract_10k | 1 | 3 | 29.145mb +0.00% | 431.769ms -3.16% | ±0.71% +5.71% | | TextExtractorBench | bench_extract_10k | 1 | 3 | 4.330mb +0.00% | 33.308ms -1.47% | ±0.11% -74.24% | | XmlExtractorBench | bench_extract_10k | 1 | 3 | 4.312mb +0.00% | 641.925ms -1.88% | ±0.42% -70.74% | +-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+ ```
Transformers ```shell +-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+ | RenameEntryTransformerBench | bench_transform_10k_rows | 1 | 3 | 116.606mb +0.00% | 60.244ms -1.16% | ±0.27% -66.12% | +-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+ ```
Loaders ```shell +--------------------+----------------+------+-----+------------------+------------------+-----------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +--------------------+----------------+------+-----+------------------+------------------+-----------------+ | CSVLoaderBench | bench_load_10k | 1 | 3 | 54.799mb +0.00% | 139.343ms -0.81% | ±0.79% +194.33% | | JsonLoaderBench | bench_load_10k | 1 | 3 | 90.382mb +0.00% | 114.931ms -1.17% | ±0.19% -70.25% | | ParquetLoaderBench | bench_load_10k | 1 | 3 | 124.436mb +0.00% | 1.226s -3.35% | ±0.05% -89.45% | | TextLoaderBench | bench_load_10k | 1 | 3 | 17.521mb +0.00% | 43.340ms -4.19% | ±0.66% +26.13% | +--------------------+----------------+------+-----+------------------+------------------+-----------------+ ```
Building Blocks ```shell +-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+ | benchmark | subject | revs | its | mem_peak | mode | rstdev | +-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+ | RowsBench | bench_chunk_10_on_10k | 2 | 3 | 87.355mb +0.00% | 3.373ms -15.35% | ±1.29% +52.64% | | RowsBench | bench_diff_left_1k_on_10k | 2 | 3 | 102.958mb +0.00% | 189.810ms -2.10% | ±0.50% -36.04% | | RowsBench | bench_diff_right_1k_on_10k | 2 | 3 | 85.678mb +0.00% | 19.022ms -2.04% | ±0.54% -2.95% | | RowsBench | bench_drop_1k_on_10k | 2 | 3 | 88.595mb +0.00% | 1.620ms -17.63% | ±1.12% -41.44% | | RowsBench | bench_drop_right_1k_on_10k | 2 | 3 | 88.595mb +0.00% | 1.634ms -17.02% | ±0.40% -44.52% | | RowsBench | bench_entries_on_10k | 2 | 3 | 85.707mb +0.00% | 2.783ms -14.72% | ±0.41% -88.21% | | RowsBench | bench_filter_on_10k | 2 | 3 | 86.236mb +0.00% | 15.051ms -4.42% | ±0.26% -78.27% | | RowsBench | bench_find_on_10k | 2 | 3 | 86.236mb +0.00% | 15.404ms -0.39% | ±0.09% -91.81% | | RowsBench | bench_find_one_on_10k | 10 | 3 | 84.139mb +0.00% | 1.706μs -14.95% | ±2.72% +17.31% | | RowsBench | bench_first_on_10k | 10 | 3 | 84.139mb +0.00% | 0.400μs 0.00% | ±0.00% 0.00% | | RowsBench | bench_flat_map_on_1k | 2 | 3 | 93.490mb +0.00% | 12.635ms -11.37% | ±0.66% -55.35% | | RowsBench | bench_map_on_10k | 2 | 3 | 122.861mb +0.00% | 62.015ms -2.36% | ±1.07% -16.90% | | RowsBench | bench_merge_1k_on_10k | 2 | 3 | 86.755mb +0.00% | 1.394ms -20.73% | ±2.42% -8.61% | | RowsBench | bench_partition_by_on_10k | 2 | 3 | 90.107mb +0.00% | 63.424ms -3.03% | ±1.15% +152.38% | | RowsBench | bench_remove_on_10k | 2 | 3 | 88.857mb +0.00% | 4.141ms -12.77% | ±0.12% -94.53% | | RowsBench | bench_sort_asc_on_1k | 2 | 3 | 84.289mb +0.00% | 39.993ms -6.52% | ±1.32% -26.65% | | RowsBench | bench_sort_by_on_1k | 2 | 3 | 84.289mb +0.00% | 41.013ms -3.62% | ±0.78% -64.44% | | RowsBench | bench_sort_desc_on_1k | 2 | 3 | 84.289mb +0.00% | 40.439ms -1.79% | ±0.72% -63.98% | | RowsBench | bench_sort_entries_on_1k | 2 | 3 | 86.581mb +0.00% | 7.419ms +0.72% | ±0.61% -52.51% | | RowsBench | bench_sort_on_1k | 2 | 3 | 84.139mb +0.00% | 29.002ms -2.79% | ±0.95% -33.90% | | RowsBench | bench_take_1k_on_10k | 10 | 3 | 84.139mb +0.00% | 13.312μs -4.51% | ±0.71% -60.88% | | RowsBench | bench_take_right_1k_on_10k | 10 | 3 | 84.139mb +0.00% | 15.879μs -3.18% | ±0.78% +0.00% | | RowsBench | bench_unique_on_1k | 2 | 3 | 102.959mb +0.00% | 192.652ms -0.98% | ±2.63% +212.27% | | TypeDetectorBench | bench_type_detector | 1 | 3 | 53.181mb +0.00% | 396.449ms -1.27% | ±0.97% +449.97% | | TypeDetectorBench | bench_type_detector | 1 | 3 | 13.447mb +0.00% | 81.240ms +0.22% | ±0.55% -77.45% | | NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 107.443mb +0.00% | 480.736ms -0.43% | ±0.74% -20.09% | | NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 55.801mb +0.00% | 242.706ms +0.25% | ±1.00% -36.83% | | NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 14.639mb +0.00% | 52.619ms -0.82% | ±0.29% -79.54% | +-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+ ```