apache / amoro

Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
https://amoro.apache.org/
Apache License 2.0
849 stars 278 forks source link

[Improvement]: Equality field id are different in a RewriteFilesInput #2870

Open XBaith opened 4 months ago

XBaith commented 4 months ago

Search before asking

What would you like to be improved?

tm_id=optimizer-kubed-bts-0-fo26ux-taskmanager-1-2
application_id=/default
java.lang.IllegalArgumentException: Equality delete files have different delete fields
    at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:143)
    at com.netease.arctic.io.reader.CombinedDeleteFilter.<init>(CombinedDeleteFilter.java:130)
    at com.netease.arctic.io.reader.GenericCombinedIcebergDataReader$GenericDeleteFilter.<init>(GenericCombinedIcebergDataReader.java:305)
    at com.netease.arctic.io.reader.GenericCombinedIcebergDataReader.<init>(GenericCombinedIcebergDataReader.java:97)
    at com.netease.arctic.optimizing.IcebergRewriteExecutor.dataReader(IcebergRewriteExecutor.java:68)
    at com.netease.arctic.optimizing.AbstractRewriteFilesExecutor.<init>(AbstractRewriteFilesExecutor.java:84)
    at com.netease.arctic.optimizing.IcebergRewriteExecutor.<init>(IcebergRewriteExecutor.java:46)
    at com.netease.arctic.optimizing.IcebergRewriteExecutorFactory.createExecutor(IcebergRewriteExecutorFactory.java:38)
    at com.netease.arctic.optimizing.IcebergRewriteExecutorFactory.createExecutor(IcebergRewriteExecutorFactory.java:25)
    at com.netease.arctic.optimizer.common.OptimizerExecutor.executeTask(OptimizerExecutor.java:148)
    at com.netease.arctic.optimizer.flink.FlinkOptimizerExecutor.executeTask(FlinkOptimizerExecutor.java:70)
    at com.netease.arctic.optimizer.common.OptimizerExecutor.start(OptimizerExecutor.java:52)
    at com.netease.arctic.optimizer.flink.FlinkExecutor.lambda$open$0(FlinkExecutor.java:59)
    at java.lang.Thread.run(Thread.java:750)

The process that have different equality field ids cannot be executed

How should we improve?

No response

Are you willing to submit PR?

Subtasks

No response

Code of Conduct

klion26 commented 3 months ago

2912 has enhanced the exception message by adding the file path of different file ids.

XBaith commented 1 month ago

Depending on the actual production scenario, idenetifier fields may change, resulting in constant failure of the optimisation task. When filtering eq-delete records, can we do splitting based on different eq-delete ids in the optimizing task? In Iceberg, it is possible to add different eq-delete predicates for all records with different eq-delete field ids. cc @zhoujinsong @zhongqishang

zhongqishang commented 1 month ago

Depending on the actual production scenario, idenetifier fields may change, resulting in constant failure of the optimisation task.

Yes, this is a common scenario, I think we need to support this feature.

When filtering eq-delete records, can we do splitting based on different eq-delete ids in the optimizing task? In Iceberg, it is possible to add different eq-delete predicates for all records with different eq-delete field ids. cc @zhoujinsong @zhongqishang

I think as you said, we can group by eq-delete ids and generate multiple predicates to filter the data.