Closed lyogev closed 3 years ago
@lyogev can you please paste the spark shell or spark submit command you used that would reproduce this ?
Spark 3.0 is a new version ( just release on 2020.6.18... ). It is quite different from 2.0, so that it is not surprising that there are some conflicts. At present, it is not recommended to use hudi by spark 3.0
lets wait for @lyogev to chime in.. I think @n3nash did explicitly test Spark 3 and confirmed it working as of 0.5.1/0.5.2
Sorry for not responding sooner. I am working on this PR: https://github.com/YotpoLtd/metorikku/pull/335 which is right now failing in CI. This PR in spark (merged for version 3.0.0): https://github.com/apache/spark/pull/28223 moved fromRow from ExpressionEncoder into another class. And it is used in hudi in https://github.com/apache/hudi/blob/89e37d5273ea1c6bf2fe3a8f7053e7a3cc44011d/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionUtils.scala#L42
Does #1760 help? If you could try that out & report back.. it'd be awesome..
Looks like it is! now I'm stuck at hive sync:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/calcite/rel/type/RelDataTypeSystem
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzerFactory.get(SemanticAnalyzerFactory.java:318)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:484)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:401)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:384)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:374)
at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:266)
at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:152)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:120)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:93)
at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
at org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:286)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:189)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:944)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:944)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:396)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:380)
I think calcite was removed from spark as a dependency
@lyogev could you try including it explcitly in hudi-spark-bundle
pom under packaging
and give it a shot? (if not, i will make sometime to try this later tonight or tomorrow morning and push a PR)
I did not run into this when testing #1760 myself, I think it might be because we have internal changes for hive3.
I just checked and it looks like we have calcite added and shaded in hudi-hive-bundle
I think you might also need to add and shade libfb303 for hive-sync
We had these dependency changes for hive3 compatibility, however, didn't realize this was needed for spark3 also. I will try to update #1760 with what is needed.
Thanks @bschell ..
Was wondering if there is an update here! Running a PoC and would love to use Hudi + Spark 3 if possible. Thanks!
@nsivabalan : Can you reply when you get a chance ?
Thanks, Balaji.V
@bschell is driving this. Ref PR: https://github.com/apache/hudi/pull/1760. @bschell : any rough timelines ?
spark 3 support is available in latest release. Closing this for now. Let us know/reopen if you have any more ask.
I am getting the same java.lang.NoClassDefFoundError: org/apache/calcite/rel/type/RelDataTypeSystem
when trying to use Hive sync. It looks like Spark 3 is using a custom class loading mechanism for loading Hive:
.https://github.com/apache/spark/blob/ab93729987084ec55f762639a7e7f7cb8dd275e1/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L170
I think this might cause some classes to be loaded only if they are coming from one of the exec jars, which I assume are the jars shipped with Spark.
I resolved the issue by deleting these two exclusions from Spark: https://github.com/apache/spark/blob/v3.0.1/pom.xml#L1692-L1699. After that calcite-core becomes part of the distribution. After that I get the following error:
SemanticException Cannot find class 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
org.apache.hadoop.hive.ql.parse.SemanticException: Cannot find class 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
at org.apache.hadoop.hive.ql.parse.ParseUtils.ensureClassExists(ParseUtils.java:263)
@dszakallas got the same behavoir on spark 3.0.2 / hudi 0.8 (RelDataTypeSystem, then HoodieParquetInputFormat)
fixed it by turning .option("hoodie.datasource.hive_sync.use_jdbc", "true")
there is then no need for calcite jars or anything
I'm getting the same java.lang.NoClassDefFoundError: org/apache/calcite/rel/type/RelDataTypeSystem
error with Spark 3.1.1 and hudi-spark3-bundle_2.12-0.8.0.jar. I am unable to use .option("hoodie.datasource.hive_sync.use_jdbc", "true")
I see mention above about missing dependency calcite-core but I don't see any reference to that package in #1760 where Spark 3 support was added. Do I need to do something now to manually deploy the calcite-core dependency? If so, how do I know what version to pull from https://repo1.maven.org/maven2/org/apache/calcite/calcite-core ? I see 1.16.0 referenced above, but I'm not clear how to know that supports my version of Hudi, Spark, etc. @bschell @nsivabalan
@dude0001
.option("hoodie.datasource.hive_sync.use_jdbc", "false")
.option("hoodie.datasource.hive_sync.mode": "hms"),
did the trick for me
For someone who gets here due to the issue with AWS Glue 3.0 + Hudi Connector.
"""# 2021-11-13 status""" AWS Glue 3.0 fails to load Image from ECR for Hudi Connector dependencies. If you don't need Hudi Schema Evolution, then go with AWS Glue 2.0. If you need Hudi Schema Evolution, then you have to use AWS Glue 3.0 otherwise you will see the issue related to Glue Catalog caused by out-dated EMRFS.
You can still use AWS Glue 3.0 + Hudi by adding Hudi JARs dependencies by yourself instead Glue Connector does it for you.
you need four dependencies.
I hope this can help someone. I was stuck in this for a day.
LOL. I've never thought that I would get a help from my own comment that I left here.
OMG. I am here again. I am building ETL in AWS EMR on EKS + Hudi.
Error I have
Caused by: java.lang.ClassNotFoundException: org.apache.calcite.rel.type.RelDataTypeSystemShow context
I looked at logs in Spark Driver. And, I noticed that Hudi is connecting to Hive. hmm. Anyone knows how to force Hudi to use AWS Glue Data Catalog?
HoodieSparkSqlWriter$: Syncing to Hive Metastore (URL: jdbc:hive2://localhost:10000)
I am tagging related github issue ( AWS limited )
I see little difference in these two different combination.
When using AWS Glue, it is enough to set "Use Glue data catalog as the Hive metastore". However, when using AWS EMR on EKS, it requires some setting.
You can find more details here https://aws.github.io/aws-emr-containers-best-practices/metastore-integrations/docs/aws-glue/#sync-hudi-table-with-aws-glue-catalog
Describe the problem you faced
Trying to run hudi with spark 3.0.0, and getting an error
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Environment Description
0.5.3
3.0.0
2.3.7
3.2.0
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :
yes
Additional context
Add any other context about the problem here.
Stacktrace