apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.44k stars 959 forks source link

[core] Fix partition column generate wrong partition spec #4349

Closed ulysses-you closed 1 month ago

ulysses-you commented 1 month ago

Purpose

Paimon uses .toString to generate partition value, which is not accurate for some data types. like date/binary. Say, Spark engine would use a Cast to convert a partition object to string value. So this pr changes to use cast to generate partition value.

Add a new config partition.legacy-name to support switch to use previous toString behavior, and by default use the legacy behavior(.toString).

An example that using binary type partition column would cause failure.

CREATE TABLE pt (
    id BIGINT,
    c1 STRING
) using paimon
PARTITIONED BY (day binary);

insert into table pt values(1, 'a', cast('2021' as binary));
select * from pt;
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (192.168.0.102 executor driver): java.io.FileNotFoundException: File 'warehouse/default.db/pt/day=%5BB@4a045a11/bucket-0/data-91c064a3-a0a1-4042-9d5a-cc82a23af7ff-0.parquet' not found, Possible causes: 1.snapshot expires too fast, you can configure 'snapshot.time-retained' option with a larger value. 2.consumption is too slow, you can improve the performance of consumption (For example, increasing parallelism).

Tests

add test

API and Format

no

Documentation

added docs

ulysses-you commented 1 month ago

But this is really dangerous to compatibility. It may be better to keep old style by default.

@JingsongLi +1