apache / amoro

Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
https://amoro.apache.org/
Apache License 2.0
849 stars 278 forks source link

[Bug][spark]: "Merge into when not matched" may result in incorrect writing of floating-point field data. #2872

Closed Aireed closed 3 months ago

Aireed commented 4 months ago

What happened?

When executing "merge into when not matched," if the inserted field value is a floating-point constant, the written data is incorrect.

spark-sql> merge into iceberg_test.mix_pk_merge_sink t
         > using (select * from iceberg_test.mix_pk_merge_source) s
         > on s.id=t.id
         > when not matched then insert (t.id,t.data,t.fdata,t.ddata,t.category,t.ts) values (5,'merge-value',1.1, 2.2,'merge-value', s.ts);
Time taken: 2.039 seconds
spark-sql> select * from mix_pk_merge_sink;
5   merge-value 1.5E-44 1.1E-322    2024-05-27 15:59:00.211 merge-value
3   apple-23    21.1    22.2    2024-05-28 14:52:43.842 fruits
1   apple-21    21.1    22.2    2024-05-28 14:52:43.842 fruits
2   apple-122   21.1    22.2    2024-05-28 14:52:43.842 fruits
Time taken: 0.514 seconds, Fetched 4 row(s)
spark-sql>

Affects Versions

master/0.6.x/0.5.x/0.4.x

What engines are you seeing the problem on?

Spark

How to reproduce

use spark-sql

CREATE TABLE iceberg_test.mix_pk_merge_source (
    id bigint not null,
    data string not null,
    fdata float not null,
    ddata double not null,
    ts timestamp not null,
    category string not null,
    primary key (id))
USING MIXED_HIVE
PARTITIONED BY (category);

insert into iceberg_test.mix_pk_merge_source VALUES (1, 'apple-1', 1.1, 2.2, current_timestamp(),'fruits'),(2, 'apple-1', 1.1, 2.2, current_timestamp(),'fruits'),(4, 'apple-1', 1.1, 2.2, current_timestamp(),'fruits');

CREATE TABLE iceberg_test.mix_pk_merge_sink (
      id bigint not null,
    data string not null,
    fdata float not null,
    ddata double not null,
    ts timestamp not null,
    category string not null,
    primary key (id))
USING MIXED_HIVE
PARTITIONED BY (category);

insert into iceberg_test.mix_pk_merge_sink VALUES (1, 'apple-21', 21.1, 22.2, current_timestamp(),'fruits'),(2, 'apple-122', 21.1, 22.2, current_timestamp(),'fruits'),(3, 'apple-23', 21.1, 22.2, current_timestamp(),'fruits');

merge into iceberg_test.mix_pk_merge_sink t
using (select * from iceberg_test.mix_pk_merge_source) s
on s.id=t.id
when not matched then insert (t.id,t.data,t.fdata,t.ddata,t.category,t.ts) values (5,'merge-value',1.1, 2.2,'merge-value', s.ts);

select * from iceberg_test.mix_pk_merge_sink;

Relevant log output

spark-sql> select * from mix_pk_merge_sink;
5   merge-value 1.5E-44 1.1E-322    2024-05-27 15:59:00.211 merge-value
3   apple-23    21.1    22.2    2024-05-28 14:52:43.842 fruits
1   apple-21    21.1    22.2    2024-05-28 14:52:43.842 fruits
2   apple-122   21.1    22.2    2024-05-28 14:52:43.842 fruits
Time taken: 0.514 seconds, Fetched 4 row(s)
spark-sql>

Anything else

No response

Are you willing to submit a PR?

Code of Conduct