apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.12k stars 2.13k forks source link

Serialization of the org.apache.iceberg.io.WriteResult class. #10710

Open xavifeds8 opened 1 month ago

xavifeds8 commented 1 month ago

Query engine

I am using Apache Flink version 1.16.

Question

Currently i am unable to fetch the TypeInformation of this org.apache.iceberg.io.WriteResult. When using the Iceberg's FlinkSink in Iceberg stream sink. For the performance reason i have disabled the kryo serialization env.getConfig().disableGenericTypes();

When executing the program i am currently getting exception. Exception in thread "main" java.lang.UnsupportedOperationException: Generic types have been disabled in the ExecutionConfig and type org.apache.iceberg.io.WriteResult is treated as a generic type. at org.apache.flink.api.java.typeutils.GenericTypeInfo.createSerializer(GenericTypeInfo.java:87) at org.apache.flink.streaming.api.graph.StreamGraph.createSerializer(StreamGraph.java:1037) at org.apache.flink.streaming.api.graph.StreamGraph.addOperator(StreamGraph.java:427) at org.apache.flink.streaming.api.graph.StreamGraph.addOperator(StreamGraph.java:399)

simonykq commented 1 month ago

+1

pvary commented 1 month ago

With a well configured IcebergSink, the number of WriteResults are quite low compared to the number of records, we did not spent the resources on writing the serializer/deserializer on them. Also WriteResult contains DataFiles which is also a complicated object, so it seemed like a serious effort for relatively low gains.

Did you see some performance issues which could be solved by this?

ms1111 commented 1 month ago

disableGenericTypes() is useful for catching missing type information during development. (Not necessarily about WriteResult, but for any other user type that's used in a job.)

As it stands, with the Iceberg sink, you wouldn't be able to use disableGenericTypes(), so changes in user code that accidentally introduce generic serialization, could be missed.