apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.43k stars 2.42k forks source link

[SUPPORT] Hudi not working with Spark 3.0.0 #1751

Closed lyogev closed 3 years ago

lyogev commented 4 years ago

Describe the problem you faced

Trying to run hudi with spark 3.0.0, and getting an error

To Reproduce

Steps to reproduce the behavior:

Expected behavior

Environment Description

0.5.3

3.0.0

2.3.7

3.2.0

yes

Additional context

Add any other context about the problem here.

Stacktrace


Caused by: java.lang.NoSuchMethodError: 'java.lang.Object org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(org.apache.spark.sql.catalyst.InternalRow)'
at org.apache.hudi.AvroConversionUtils$.$anonfun$createRdd$1(AvroConversionUtils.scala:42)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$SliceIterator.next(Iterator.scala:271)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1423)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2133)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)```
bhasudha commented 4 years ago

@lyogev can you please paste the spark shell or spark submit command you used that would reproduce this ?

cdmikechen commented 4 years ago

Spark 3.0 is a new version ( just release on 2020.6.18... ). It is quite different from 2.0, so that it is not surprising that there are some conflicts. At present, it is not recommended to use hudi by spark 3.0

vinothchandar commented 4 years ago

lets wait for @lyogev to chime in.. I think @n3nash did explicitly test Spark 3 and confirmed it working as of 0.5.1/0.5.2

lyogev commented 4 years ago

Sorry for not responding sooner. I am working on this PR: https://github.com/YotpoLtd/metorikku/pull/335 which is right now failing in CI. This PR in spark (merged for version 3.0.0): https://github.com/apache/spark/pull/28223 moved fromRow from ExpressionEncoder into another class. And it is used in hudi in https://github.com/apache/hudi/blob/89e37d5273ea1c6bf2fe3a8f7053e7a3cc44011d/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionUtils.scala#L42

vinothchandar commented 4 years ago

Does #1760 help? If you could try that out & report back.. it'd be awesome..

lyogev commented 4 years ago

Looks like it is! now I'm stuck at hive sync:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/calcite/rel/type/RelDataTypeSystem
    at org.apache.hadoop.hive.ql.parse.SemanticAnalyzerFactory.get(SemanticAnalyzerFactory.java:318)
    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:484)
    at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
    at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:401)
    at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:384)
    at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:374)
    at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:266)
    at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:152)
    at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:120)
    at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:93)
    at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
    at org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:286)
    at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:189)
    at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
    at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:944)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:944)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:396)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:380)

I think calcite was removed from spark as a dependency

vinothchandar commented 4 years ago

@lyogev could you try including it explcitly in hudi-spark-bundle pom under packaging and give it a shot? (if not, i will make sometime to try this later tonight or tomorrow morning and push a PR)

bschell commented 4 years ago

I did not run into this when testing #1760 myself, I think it might be because we have internal changes for hive3.

I just checked and it looks like we have calcite added and shaded in hudi-hive-bundle

org.apache.calcite calcite-core 1.16.0

I think you might also need to add and shade libfb303 for hive-sync

org.apache.thrift libfb303 0.9.3

We had these dependency changes for hive3 compatibility, however, didn't realize this was needed for spark3 also. I will try to update #1760 with what is needed.

vinothchandar commented 4 years ago

Thanks @bschell ..

shashwatsrivastava94 commented 4 years ago

Was wondering if there is an update here! Running a PoC and would love to use Hudi + Spark 3 if possible. Thanks!

bvaradar commented 4 years ago

@nsivabalan : Can you reply when you get a chance ?

Thanks, Balaji.V

nsivabalan commented 4 years ago

@bschell is driving this. Ref PR: https://github.com/apache/hudi/pull/1760. @bschell : any rough timelines ?

nsivabalan commented 3 years ago

spark 3 support is available in latest release. Closing this for now. Let us know/reopen if you have any more ask.

dszakallas commented 3 years ago

I am getting the same java.lang.NoClassDefFoundError: org/apache/calcite/rel/type/RelDataTypeSystem when trying to use Hive sync. It looks like Spark 3 is using a custom class loading mechanism for loading Hive: .https://github.com/apache/spark/blob/ab93729987084ec55f762639a7e7f7cb8dd275e1/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L170 I think this might cause some classes to be loaded only if they are coming from one of the exec jars, which I assume are the jars shipped with Spark.

dszakallas commented 3 years ago

I resolved the issue by deleting these two exclusions from Spark: https://github.com/apache/spark/blob/v3.0.1/pom.xml#L1692-L1699. After that calcite-core becomes part of the distribution. After that I get the following error:

SemanticException Cannot find class 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
org.apache.hadoop.hive.ql.parse.SemanticException: Cannot find class 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
    at org.apache.hadoop.hive.ql.parse.ParseUtils.ensureClassExists(ParseUtils.java:263)
parisni commented 3 years ago

@dszakallas got the same behavoir on spark 3.0.2 / hudi 0.8 (RelDataTypeSystem, then HoodieParquetInputFormat)

fixed it by turning .option("hoodie.datasource.hive_sync.use_jdbc", "true") there is then no need for calcite jars or anything

dude0001 commented 3 years ago

I'm getting the same java.lang.NoClassDefFoundError: org/apache/calcite/rel/type/RelDataTypeSystem error with Spark 3.1.1 and hudi-spark3-bundle_2.12-0.8.0.jar. I am unable to use .option("hoodie.datasource.hive_sync.use_jdbc", "true") I see mention above about missing dependency calcite-core but I don't see any reference to that package in #1760 where Spark 3 support was added. Do I need to do something now to manually deploy the calcite-core dependency? If so, how do I know what version to pull from https://repo1.maven.org/maven2/org/apache/calcite/calcite-core ? I see 1.16.0 referenced above, but I'm not clear how to know that supports my version of Hudi, Spark, etc. @bschell @nsivabalan

parisni commented 3 years ago

@dude0001

.option("hoodie.datasource.hive_sync.use_jdbc", "false")
.option("hoodie.datasource.hive_sync.mode": "hms"),

did the trick for me

Gatsby-Lee commented 3 years ago

For someone who gets here due to the issue with AWS Glue 3.0 + Hudi Connector.

"""# 2021-11-13 status""" AWS Glue 3.0 fails to load Image from ECR for Hudi Connector dependencies. If you don't need Hudi Schema Evolution, then go with AWS Glue 2.0. If you need Hudi Schema Evolution, then you have to use AWS Glue 3.0 otherwise you will see the issue related to Glue Catalog caused by out-dated EMRFS.

You can still use AWS Glue 3.0 + Hudi by adding Hudi JARs dependencies by yourself instead Glue Connector does it for you.

you need four dependencies.

I hope this can help someone. I was stuck in this for a day.

Gatsby-Lee commented 2 years ago

LOL. I've never thought that I would get a help from my own comment that I left here.

Gatsby-Lee commented 2 years ago

OMG. I am here again. I am building ETL in AWS EMR on EKS + Hudi.

Error I have

Caused by: java.lang.ClassNotFoundException: org.apache.calcite.rel.type.RelDataTypeSystemShow context

I looked at logs in Spark Driver. And, I noticed that Hudi is connecting to Hive. hmm. Anyone knows how to force Hudi to use AWS Glue Data Catalog?

HoodieSparkSqlWriter$: Syncing to Hive Metastore (URL: jdbc:hive2://localhost:10000)
Gatsby-Lee commented 2 years ago

I am tagging related github issue ( AWS limited )

Gatsby-Lee commented 2 years ago

I see little difference in these two different combination.

When using AWS Glue, it is enough to set "Use Glue data catalog as the Hive metastore". However, when using AWS EMR on EKS, it requires some setting.

You can find more details here https://aws.github.io/aws-emr-containers-best-practices/metastore-integrations/docs/aws-glue/#sync-hudi-table-with-aws-glue-catalog