apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.32k stars 2.41k forks source link

[SUPPORT] Flink SQL client cow table query error "org/apache/parquet/column/ColumnDescriptor" (but mor table query normal) #6297

Closed fujianhua168 closed 1 year ago

fujianhua168 commented 2 years ago

Describe the problem you faced i create a cow table named 'hudi_cow_tbl' and a mor table named 'hudi_mor_tbl' in the flink sql client mode, then both insert into row data. after that, i query the two table data, result is : mor table normal, but cow table 'hudi_cow_tbl' occure a error. the error as below:

org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138) at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82) at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:252) at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:242) at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:233) at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:684) at org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:79) at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:444) at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at akka.actor.Actor.aroundReceive(Actor.scala:537) at akka.actor.Actor.aroundReceive$(Actor.scala:535) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) at akka.actor.ActorCell.invoke(ActorCell.scala:548) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) at akka.dispatch.Mailbox.run(Mailbox.scala:231) at akka.dispatch.Mailbox.exec(Mailbox.scala:243) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) Caused by: java.lang.LinkageError: loader constraint violation: loader (instance of sun/misc/Launcher$AppClassLoader) previously initiated loading for a different type with name "org/apache/parquet/column/ColumnDescriptor" at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.flink.formats.parquet.vector.reader.AbstractColumnReader.(AbstractColumnReader.java:108) at org.apache.flink.formats.parquet.vector.reader.BytesColumnReader.(BytesColumnReader.java:35) at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createColumnReader(ParquetSplitReaderUtil.java:332) at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createColumnReader(ParquetSplitReaderUtil.java:297) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.readNextRowGroup(ParquetColumnarRowSplitReader.java:329) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.nextBatch(ParquetColumnarRowSplitReader.java:305) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.ensureBatch(ParquetColumnarRowSplitReader.java:287) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.reachedEnd(ParquetColumnarRowSplitReader.java:266) at org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.reachedEnd(CopyOnWriteInputFormat.java:274) at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:89) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67) at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:323)

To Reproduce i only reproduce the cow table behavior:

 step1: one session window,execute follow shell command to open a yarn-session-cluster: 
          cd /opt/apache/flink  
          bin/yarn-session.sh -m yarn-cluster -nm flink_test -qu default

 step2:  open the flink sql client at another session window:
         cd /opt/apache/flink
         /bin/sql-client.sh embedded -s yarn-session -j ./lib/hudi-flink1.14-bundle_2.11-0.11.1.jar shell

 step3:   reproduce the error:
      CREATE CATALOG myhive WITH (
        'type' = 'hive',
        'default-database' = 'default',
        'hive-conf-dir' = '/opt/apache/hive/conf/',
        'hadoop-conf-dir'='/opt/apache/hadoop/etc/hadoop/'
      );
      USE CATALOG myhive;
      use flink_demo;
      drop table hudi_cow_tbl;
      CREATE TABLE hudi_cow_tbl(
        uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
        name VARCHAR(10),
        age INT,
        ts TIMESTAMP(3)
      )
      WITH (
        'connector' = 'hudi',
        'path' = 'hdfs://bigbigworld/user/hive/warehouse/flink_demo.db/hudi_cow_tbl'
      );
     INSERT INTO hudi_cow_tbl VALUES
    ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01'),
    ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02'),
    ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03'),
    ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04'),
    ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05'),
    ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06'),
    ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07'),
    ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08');
    select * from hudi_cow_tbl; -- after about 5~8s, have a error above.

Expected behavior why mor table can normal query, but the cow table query error, what's the reason?(note: may be some config params is needed, may be some jars file in the $flink_home/lib dir is missed? may be some jars package conflict?) after query error occure, i try to repackage the $root_path/packaging/hudi-flink-bundle module with the mvn command as below then replace the $flink_home/lib/hudi-flink1.14-bundle_2.11-0.11.1.jar , but not effect yet (remark: with parquet.version = 1.11.1 or 1.12.2 ,both not effect) . mvn install -DskipTests -Drat.skip=true -Dscala-2.11 -Dflink.format.parquet.version=1.11.1 -Dparquet.version=1.11.1 -Pflink-bundle-shade-hive3

Environment Description

Stacktrace

yihua commented 2 years ago

@danny0405 could this be due to the dependency conflict?

danny0405 commented 2 years ago

The app uses wrong classloader, would suggest to use the per-job mode instead of yarn-session.

xushiyan commented 1 year ago

The app uses wrong classloader, would suggest to use the per-job mode instead of yarn-session.

Thanks @danny0405 for the suggestion. @fujianhua168 have you tried the suggestion? will close this in a week time if no further update here.

danny0405 commented 1 year ago

Close for inactivility.

MarlboroBoy commented 1 year ago

@fujianhua168 I have the same problem as you. How did you solve it

xiaolan-bit commented 9 months ago

how to slove this question? add or replace any jar?

xiaolan-bit commented 9 months ago

when i use select * ,an error appear:java.lang.LinkageError: org/apache/parquet/column/ColumnDescriptor at org.apache.flink.formats.parquet.vector.reader.AbstractColumnReader.(AbstractColumnReader.java:108) at org.apache.flink.formats.parquet.vector.reader.BytesColumnReader.(BytesColumnReader.java:35) at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createColumnReader(ParquetSplitReaderUtil.java:364) at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createColumnReader(ParquetSplitReaderUtil.java:329) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.readNextRowGroup(ParquetColumnarRowSplitReader.java:334) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.nextBatch(ParquetColumnarRowSplitReader.java:310) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.ensureBatch(ParquetColumnarRowSplitReader.java:292) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.reachedEnd(ParquetColumnarRowSplitReader.java:271) at org.apache.hudi.table.format.ParquetSplitRecordIterator.hasNext(ParquetSplitRecordIterator.java:42) at org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.reachedEnd(CopyOnWriteInputFormat.java:283) at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:89) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67) at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)

xiaolan-bit commented 9 months ago

is the jar about parquet have question? my flink version is 1.17.1 but the parquet version is 1.13.0

danny0405 commented 9 months ago

That is becase you use the shaded parquet with hudi prefix, sometime if the referenced class does not cosrespond to the native class, this error throws, a solution is to remove the shade of the parquet, using the correct parquet version instead, and that is why the hudi flink bundle jar by default does not shade the parquet jar.

xiaolan-bit commented 9 months ago

it did not work. there are some jars below: flink1.17.1/lib: -rw-rw-rw- 1 hadoop hadoop 7304133 Nov 23 10:40 calcite-core-1.29.0.jar -rw-r--r-- 1 hadoop hadoop 196491 May 19 2023 flink-cep-1.17.1.jar -rw-r--r-- 1 hadoop hadoop 542620 May 19 2023 flink-connector-files-1.17.1.jar -rw-r--r-- 1 hadoop hadoop 102472 May 19 2023 flink-csv-1.17.1.jar -rw-r--r-- 1 hadoop hadoop 135975541 May 19 2023 flink-dist-1.17.1.jar -rw-r--r-- 1 hadoop hadoop 180248 May 19 2023 flink-json-1.17.1.jar -rw-r--r-- 1 hadoop hadoop 21043319 May 19 2023 flink-scala_2.12-1.17.1.jar -rw-rw-rw- 1 hadoop hadoop 39256670 Nov 15 11:21 flink-shaded-hadoop-3-3.1.1.7.2.8.0-224-9.0.jar -rw-rw-rw- 1 hadoop hadoop 51337510 Nov 22 16:35 flink-sql-connector-hive-3.1.3_2.12-1.17.1.jar -rw-r--r-- 1 hadoop hadoop 15407424 May 19 2023 flink-table-api-java-uber-1.17.1.jar -rw-r--r-- 1 hadoop hadoop 38191226 May 19 2023 flink-table-planner-loader-1.17.1.jar -rw-r--r-- 1 hadoop hadoop 3146210 May 19 2023 flink-table-runtime-1.17.1.jar -rw-rw-rw- 1 hadoop hadoop 31532 Nov 15 15:25 htrace-core-2.04.jar -rw-rw-rw- 1 hadoop hadoop 1475955 Nov 15 15:26 htrace-core-3.1.0-incubating.jar -rw-rw-rw- 1 hadoop hadoop 1485102 Nov 15 15:26 htrace-core4-4.0.1-incubating.jar -rw-rw-rw- 1 hadoop hadoop 94889852 Nov 23 16:39 hudi-flink1.17-bundle-0.14.0.jar -rw-r--r-- 1 root root 36360583 Nov 22 10:51 hudi-hadoop-mr-bundle-0.12.3.jar -rw-r--r-- 1 root root 36550073 Nov 22 10:51 hudi-hive-sync-bundle-0.12.3.jar -rw-r--r-- 1 hadoop hadoop 30980105 Aug 28 16:07 iceberg-flink-runtime-1.17-1.3.1.jar -rw-r--r-- 1 hadoop hadoop 208006 May 17 2023 log4j-1.2-api-2.17.1.jar -rw-r--r-- 1 hadoop hadoop 301872 May 17 2023 log4j-api-2.17.1.jar -rw-r--r-- 1 hadoop hadoop 1790452 May 17 2023 log4j-core-2.17.1.jar -rw-r--r-- 1 hadoop hadoop 24279 May 17 2023 log4j-slf4j-impl-2.17.1.jar -rw-rw-rw- 1 hadoop hadoop 909584 Nov 23 15:13 parquet-avro-1.13.0.jar -rw-rw-rw- 1 hadoop hadoop 2027225 Nov 23 12:46 parquet-column-1.13.0.jar -rw-rw-rw- 1 hadoop hadoop 97186 Nov 23 12:46 parquet-common-1.13.0.jar -rw-rw-rw- 1 hadoop hadoop 849289 Nov 23 12:46 parquet-encoding-1.13.0.jar -rw-rw-rw- 1 hadoop hadoop 726179 Nov 23 12:46 parquet-format-structures-1.13.0.jar -rw-rw-rw- 1 hadoop hadoop 1004565 Nov 23 12:46 parquet-hadoop-1.13.0.jar -rw-rw-rw- 1 hadoop hadoop 2017387 Nov 23 12:46 parquet-jackson-1.13.0.jar and the Some commands are below: ./bin/yarn-session.sh -d ./bin/sql-client.sh embedded -j lib/hudi-flink1.17-bundle-0.14.0.jar shell

set execution.checkpointing.interval=3sec; set sql-client.execution.result-mode = tableau; create table flink_hudi_hive ( uuid STRING PRIMARY KEY NOT ENFORCED, name STRING, age INT, ts STRING, partition STRING ) PARTITIONED BY(partition) WITH ( 'connector'='hudi', 'path'='hdfs://hdfs-ha/hudi/flink/flink_hudi_hive', 'table.type'='COPY_ON_WRITE', 'hoodie.datasource.write.recordkey.field'='uuid', 'write.precombine.field'='ts', 'write.tasks'='1', 'write.rate.Limit'='2000', 'compaction.tasks'='1', 'compaction.async.enabled'='true', 'compaction.trigger.strategy'='num_commits', 'compaction.delta_commits'='1', 'changelog.enabled'='true', 'read.streaming.check-interval'='3', 'hive_sync.enable'='true', 'hive_sync.mode'='hms', 'hive_sync.metastore.uris'='thrift://12345-az1-master-1-1:9083', 'hive_sync.jdbc_url'='jdbc:hive2://12345-az1-master-1-1:10001/default;transportMode=http;httpPath=cliservice', 'hive_sync.table'='flink_hudi_hive', 'hive_sync.db'='default', 'hive_sync.support_timestamp'='true' ); INSERT INTO flink_hudi_hive VALUES ('id1', 'Tom', 25, '1970-01-01 00:00:03', 'par1'), ('id2', 'Jerry', 30, '1970-01-01 00:00:04', 'par1'), ('id3', 'Tom', 25, '1970-01-01 00:00:03', 'par2'), ('id4', 'Jerry', 30, '1970-01-01 00:00:04', 'par2'), ('id5', 'Spike', 35, '1970-01-01 00:00:05', 'par3'), ('id6', 'Tyke', 40, '1970-01-01 00:00:06', 'par4'), ('id7', 'Butch', 45, '1970-01-01 00:00:07', 'par4');

when I enter:SELECT * FROM flink_hudi_hive; there are the error appear: 2023-11-23 16:45:48 java.lang.LinkageError: org/apache/parquet/column/ColumnDescriptor at org.apache.flink.formats.parquet.vector.reader.AbstractColumnReader.(AbstractColumnReader.java:108) at org.apache.flink.formats.parquet.vector.reader.BytesColumnReader.(BytesColumnReader.java:35) at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createColumnReader(ParquetSplitReaderUtil.java:364) at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createColumnReader(ParquetSplitReaderUtil.java:329) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.readNextRowGroup(ParquetColumnarRowSplitReader.java:334) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.nextBatch(ParquetColumnarRowSplitReader.java:310) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.ensureBatch(ParquetColumnarRowSplitReader.java:292) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.reachedEnd(ParquetColumnarRowSplitReader.java:271) at org.apache.hudi.table.format.ParquetSplitRecordIterator.hasNext(ParquetSplitRecordIterator.java:42) at org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.reachedEnd(CopyOnWriteInputFormat.java:283) at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:89) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67) at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)

xiaolan-bit commented 9 months ago

I don't know how to fix this, can you help to give me some suggestion?

danny0405 commented 9 months ago

What parquet version you use when packaging the flink hudi bundle jar?

xiaolan-bit commented 9 months ago

1.13.0

xiaolan-bit commented 9 months ago

i vim the flink-conf.yaml set: classloader.resolve-order: parent-first and i can select from flink sql but hive sql some errors appear in hive sql: hive (default)> select * from flink_hudi_hive; WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked. WARN: Please see http://www.slf4j.org/codes.html#release for an explanation. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/calcite/rel/rules/AbstractMaterializedViewRule$MaterializedViewProjectFilterRule at org.apache.hadoop.hive.ql.parse.CalcitePlanner.logicalPlan(CalcitePlanner.java:1408) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.getOptimizedAST(CalcitePlanner.java:1430) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:450) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12164) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:330) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:285) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:659) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1826) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1773) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1768) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:126) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:214) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:239) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:402) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:328) at org.apache.hadoop.util.RunJar.main(RunJar.java:241) Caused by: java.lang.ClassNotFoundException: org.apache.calcite.rel.rules.AbstractMaterializedViewRule$MaterializedViewProjectFilterRule at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 24 more

danny0405 commented 9 months ago

1.13.0

Both Flink https://github.com/apache/flink/blob/cef22667ae65fb6fda78f22a05a8ca91540af687/flink-formats/pom.xml#L33 and Hudi Flink 1.17 https://github.com/apache/hudi/blob/bcb974b48133273ec0ef3e33b91889c61609e8b2/pom.xml#L2607 uses parquet 1.12.3, can you use parquet 1.12.3 to see whether the issue is gone there?

The caiclite error indicates there are some calcite version mismatch, did you check your bundle jar whether you there are Calcite classes?

danny0405 commented 9 months ago

It looks like we have inconsistency for parquet-avro and parquet-hadoop, the parquet.version property in hudi root pom is 1.10.1 but for flink 1.17 we try to override it to be 1.12.3, can you try to repackage you hundle jar by always specifying the flink1.17 profile? So that we can make sure the parquet version is consistent.

xiaolan-bit commented 9 months ago

1.13.0

Both Flink https://github.com/apache/flink/blob/cef22667ae65fb6fda78f22a05a8ca91540af687/flink-formats/pom.xml#L33 and Hudi Flink 1.17

https://github.com/apache/hudi/blob/bcb974b48133273ec0ef3e33b91889c61609e8b2/pom.xml#L2607

uses parquet 1.12.3, can you use parquet 1.12.3 to see whether the issue is gone there? The caiclite error indicates there are some calcite version mismatch, did you check your bundle jar whether you there are Calcite classes?

hi I changed my parquet version to 1.12.3 in flink-1.17-1/lib and hive 3.1.3/lib such as -rw-rw-rw- 1 hadoop hadoop 909696 Nov 24 10:21 parquet-avro-1.12.3.jar -rw-rw-rw- 1 hadoop hadoop 2026078 Nov 24 10:21 parquet-column-1.12.3.jar -rw-rw-rw- 1 hadoop hadoop 96489 Nov 24 10:21 parquet-common-1.12.3.jar -rw-rw-rw- 1 hadoop hadoop 848483 Nov 24 10:21 parquet-encoding-1.12.3.jar -rw-rw-rw- 1 hadoop hadoop 726071 Nov 24 10:21 parquet-format-structures-1.12.3.jar -rw-rw-rw- 1 hadoop hadoop 990925 Nov 24 10:21 parquet-hadoop-1.12.3.jar -rw-rw-rw- 1 hadoop hadoop 2015374 Nov 24 10:21 parquet-jackson-1.12.3.jar and i set add a '#' berfore 'classloader.resolve-order: parent-first' in flink-conf.yaml when i used the case before, the error 'java.lang.LinkageError: org/apache/parquet/column/ColumnDescriptor' appear again, and i remove the '#' before 'classloader.resolve-order: parent-first' in flink-conf.yaml , i can select * from my table in flink sql but hive sql.

xiaolan-bit commented 9 months ago

It looks like we have inconsistency for parquet-avro and parquet-hadoop, the parquet.version property in hudi root pom is 1.10.1 but for flink 1.17 we try to override it to be 1.12.3, can you try to repackage you hundle jar by always specifying the flink1.17 profile? So that we can make sure the parquet version is consistent.

i git clone the hudi-release-0.14.0 code, and open packaging/hudi-flink-hundle project, and use mvn clean install -D skipTests -Drat.skip=true -Dflink1.17 -Dscala-2.12 -Pflink-bundle-shade-hive3 to package a jar named 'hudi-flink1.17-bundle-0.14.0.jar' whose size is 90.4MB,and i upload this jar to flink's lib and redo case , the error appear 'java.lang.LinkageError: org/apache/parquet/column/ColumnDescriptor'

danny0405 commented 9 months ago

You have multiple parquet jars on the classpath, is the parquet related jars added by yourself? The hudi flink bundle jar already includes the parquet jars.

xiaolan-bit commented 9 months ago

You have multiple parquet jars on the classpath, is the parquet related jars added by yourself? The hudi flink bundle jar already includes the parquet jars.

yes, i download these jars form the web, and add them to the lib. I remove the flink/lib parquet-*.jar and change the pom.xml's parquet.version from 1.12.2 to 1.12.3 and then repackage the hudi-flink1.17-bundle-0.14.0.jar and use it. the error below appear again: 2023-11-24 14:06:38 java.lang.LinkageError: org/apache/parquet/column/ColumnDescriptor at org.apache.flink.formats.parquet.vector.reader.AbstractColumnReader.(AbstractColumnReader.java:108) at org.apache.flink.formats.parquet.vector.reader.BytesColumnReader.(BytesColumnReader.java:35) at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createColumnReader(ParquetSplitReaderUtil.java:364) at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createColumnReader(ParquetSplitReaderUtil.java:329) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.readNextRowGroup(ParquetColumnarRowSplitReader.java:334) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.nextBatch(ParquetColumnarRowSplitReader.java:310) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.ensureBatch(ParquetColumnarRowSplitReader.java:292) at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.reachedEnd(ParquetColumnarRowSplitReader.java:271) at org.apache.hudi.table.format.ParquetSplitRecordIterator.hasNext(ParquetSplitRecordIterator.java:42) at org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.reachedEnd(CopyOnWriteInputFormat.java:283) at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:89) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67) at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)

xiaolan-bit commented 9 months ago

i try another way, i use hudi-flink1.17-bundle-0.14.0.jar this jar directly, and i add parquet-.jar to flink-1.17.1/lib , flink sql can do insert task but select , it also appear 'java.lang.LinkageError: org/apache/parquet/column/ColumnDescriptor'

danny0405 commented 9 months ago

I try to reproduce it in my local env, it works from my side, here is how i package the hudi flink bundle jar:

  1. cd into $HUDI_HOME, execute mvn clean install -DskipTests;
  2. cd into $HUDI_HOME/hudi-client/hudi-flink-client, execute mvn clean install -DskipTests -Pflink1.17;
  3. cd into $HUDI_HOME/hudi-flink-datasource, execute mvn clean install -DskipTests -Pflink1.17;
  4. cd into $HUDI_HOME/packaging/hudi-flink-bundle, execute mvn clean install -DskipTests -Pflink1.17

Then copy the jar hudi-flink1.17-bundle-xxx.jar into $FLINK_HOME/lib. And the COW table can work perfectly:

image

image

Jimnywen commented 6 months ago

I solved the similar question without passing the args -j lib/hudi-flink1.17-bundle-xxx.jar, just run sql-client.sh simply like sql-client.sh embedded shell.

ChrisKyrie commented 3 months ago

I solved the similar question without passing the args -j lib/hudi-flink1.17-bundle-xxx.jar, just run sql-client.sh simply like sql-client.sh embedded shell.

I solved this problem in the same way. I use -j hudi-flink1.17-bundle-xxx.jar in sql-client,and I did not add hudi-flink1.17-bundle-xxx.jar in ${FLINK_HOME}/lib