Closed worf0815 closed 2 years ago
@worf0815 can you confirm that you can use the default database form glue catalog with Hudi, and switching databases is what is Causing this.
https://github.com/localstack/localstack/issues/1031 seems to be related, I will try to reproduce the issue and see how we can relocate jackson.
@nsivabalan I have verified that this issue is resolved in 0.11 SNAPSHOT, do you know if we have relocated jackson recently
@worf0815 can you try with Hudi 0.10.1 on EMR here is the command to launch pyspark
pyspark --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
@rkkalluri I can confirm with above settings using hudi 0.10.1 everything works as expected :)
@worf0815 you mean it works well at hudi 0.10.1?
thanks @rkkalluri for helping out. @worf0815 : will go ahead and close out the issue.
@worf0815 you mean it works well at hudi 0.10.1?
Yes, the Jackson-Issues was solved and it is now working as expected...
@nsivabalan @worf0815 will there be some way to make it work on older hudi version?
I'm working with scala, so can it be solved by importing jackson
separately?
I'm working with scala, so can it be solved by importing
jackson
separately?
If you are using EMR, AWS-Support recommended to separately specify the AWS dependency, e.g. for pyspark (though the same should work for spark-submit as well):
pyspark --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.sql.hive.convertMetastoreParquet=false" --jars /usr/share/aws/aws-java-sdk/aws-java-sdk-bundle-1.12.31.jar,/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/jars/spark-avro.jar
spark.{driver,executor}.userClassPathFirst=true will also hint spark to prioritize the --jars you provide
@worf0815 so will it be solved by using aws-java-sdk-bundle-1.12.31.jar, hudi-spark-bundle.jar, spark-avro.jar
jar files inside EMR?
I confirmed that adding aws-java-sdk-bundle-1.12.31.jar explicitly to --jars resolved this issue.
Describe the problem you faced
Running pyspark on AWS EMR 6.5.0 Cluster with Hudi Enabled results in an exception when trying to access the glue catalog.
To Reproduce
Steps to reproduce the behavior:
pyspark --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.sql.hive.convertMetastoreParquet=false" --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar
spark.catalog.setCurrentDatabase("mydatabase")
java.lang.NoSuchMethodError: com.amazonaws.transform.JsonUnmarshallerContext.getCurrentToken()Lcom/amazonaws/thirdparty/jackson/core/JsonToken;
is thrownExpected behavior
Without specifiying any of the Hudi Jars or options, pyspark is able to connect to the glue catalog. This should be also possible with Hudi.
Environment Description
Hudi version : 0.9.0 (included in EMR 6.5.0)
Spark version : 3.1.2
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
Used EMR 6.5.0 and started pyspark shell according to https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
Stacktrace
>>> spark.catalog.setCurrentDatabase("mydatabase")