apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.43k stars 954 forks source link

[Bug] Hive DDL and paimon schema mismatched #4556

Open GangYang-HX opened 1 day ago

GangYang-HX commented 1 day ago

Search before asking

Paimon version

Paimon-0.8.1

Compute Engine

Flink-1.18.1

Minimal reproduce step

  1. Start a Spark offline task containing a large number of tasks to read the Paimon table data
  2. During the offline task, add a new field Not necessarily displayed, there is a high probability!!!

What doesn't meet your expectations?

The alterTable operation is not atomic. When reading the Paimon table data, the Hive field and Paimon latest-schema information will be checked. There is a certain probability that they will not match and eventually cause query exceptions.

Hive DDL and paimon schema mismatched! It is recommended not to write any column definition as Paimon external table can read schema from the specified location. There are 1665 fields in Hive DDL: id, sticky_album_id ...... There are 1666 fields in Paimon schema: id, sticky_album_id ...... at org.apache.paimon.hive.HiveSchema.checkFieldsMatched(HiveSchema.java:249) at org.apache.paimon.hive.HiveSchema.extract(HiveSchema.java:165) at org.apache.paimon.hive.PaimonStorageHandler.getDataFieldsJsonStr(PaimonStorageHandler.java:89) at org.apache.paimon.hive.PaimonStorageHandler.configureInputJobProperties(PaimonStorageHandler.java:84) at org.apache.spark.sql.hive.HiveTableUtil$.configureJobPropertiesForStorageHandler(TableReader.scala:438) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:468) at org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:354) at org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:354) at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:184) at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:184) at scala.Option.foreach(Option.scala:407) at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$6(HadoopRDD.scala:184) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:181)

Anything else?

image

org.apache.paimon.hive.HiveCatalog#alterTableImpl

image

org.apache.paimon.hive.HiveSchema#checkFieldsMatched

Are you willing to submit a PR?