apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.23k stars 2.39k forks source link

[SUPPORT] cannot assign instance of java.lang.invoke.SerializedLambda #8340

Closed TranHuyTiep closed 1 year ago

TranHuyTiep commented 1 year ago

Describe the problem you faced

Environment Description

Additional context

KnightChess commented 1 year ago

can you provide a simple reproduce step, like code or sql

TranHuyTiep commented 1 year ago

can you provide a simple reproduce step, like code or sql

Here is my code

`# -- coding: utf-8 -- from pyspark.conf import SparkConf from pyspark.sql import SparkSession

if name == 'main': print("_____Run__")

spark_conf = SparkConf()
spark_conf.set("spark.jars.packages", "org.apache.hudi:hudi-spark3.2-bundle_2.12:0.13.0,org.apache.spark:spark-avro_2.12:3.3.1")
spark_conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark_conf.set("spark.sql.hive.convertMetastoreParquet", "false")
spark_conf.set("spark.rdd.compress", "true")

sparkSession = (SparkSession
                .builder
                .appName('read_hudi')
                .config(conf=spark_conf)
                .getOrCreate())

file_path = "hdfs://hdfs_host:9000/data/dwd/trans_event"

# READ HUDI
df_load = sparkSession.read.format("org.apache.hudi").load(
    file_path, inferSchema=True,
    header=True
)

# READ
df_load.createOrReplaceTempView("trans_event")
query_e34_trans_raw = """
    SELECT transaction_data.profile_id FROM trans_event limit 10
"""
df_load_profile_trans = sparkSession.sql(sqlQuery=query_e34_trans_raw)
df_load_profile_trans.printSchema()
df_load_profile_trans.show(10)
sparkSession.stop()

print("_____________End______________")

`

KnightChess commented 1 year ago

is it can work in your local ? not k8s ?

TranHuyTiep commented 1 year ago

is it can work in your local ? not k8s ? yes, it can work in local I set up spark_conf.setMaster("local[*]") can work in k8s but not create executor and run in one driver

KnightChess commented 1 year ago

have you point some config in your submit command, something like:

pyspark \
--packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

and look like the env is different in diff submit mode. I think you can confirm the env, for example, if you in cluster mode and use yarn archive, make sure hudi jar in archive and so on.

ad1happy2go commented 1 year ago

@TranHuyTiep Were you able to resolve this issue or still facing the same?

codope commented 1 year ago

cc @harsh1231

bigdata-spec commented 1 year ago

@TranHuyTiep I have the same environment on k8s. how can I connect with you?

TranHuyTiep commented 1 year ago

@TranHuyTiep Were you able to resolve this issue or still facing the same? I solved the above problem by build new image and copy all packages in .ivy2/jars/* to /opt/spark/jars/

ad1happy2go commented 1 year ago

Thanks @TranHuyTiep. Closing the issue as you are able to fix. Please reopen if you see issue again.

parisni commented 4 months ago

I solved the above problem by build new image and copy all packages in .ivy2/jars/* to /opt/spark/jars/

same here on kubernetes. Sounds like k8s does not works well with spark.jars.packages and hudi