Open curio77 opened 4 weeks ago
It's look like a scala version conflict.
Yes, but how to resolve? This happens following that tutorial without any adjustments with respect to Scala. I've tried alternative Scala versions already by adding, e.g., -Dscala-2.11
to the mvn package
command, but then it fails already at the build stage.
@ad1happy2go Do you have some insights here?
@curio77 I am trying to setup in my machine. Will update.
confirmed.
albert@Alberts-MBP docker % docker exec -it adhoc-2 /bin/bash
root@adhoc-2:/opt# spark-submit \
> --class org.apache.hudi.utilities.streamer.HoodieStreamer $HUDI_UTILITIES_BUNDLE \
> --table-type COPY_ON_WRITE \
> --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
> --source-ordering-field ts \
> --target-base-path /user/hive/warehouse/stock_ticks_cow \
> --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties \
> --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
24/08/26 21:12:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/08/26 21:12:04 WARN streamer.SchedulerConfGenerator: Job Scheduling Configs will not be in effect as spark.scheduler.mode is not set to FAIR at instantiation time. Continuing without scheduling configs
24/08/26 21:12:04 INFO spark.SparkContext: Running Spark version 2.4.4
24/08/26 21:12:04 INFO spark.SparkContext: Submitted application: streamer-stock_ticks_cow
24/08/26 21:12:04 INFO spark.SecurityManager: Changing view acls to: root
24/08/26 21:12:04 INFO spark.SecurityManager: Changing modify acls to: root
24/08/26 21:12:04 INFO spark.SecurityManager: Changing view acls groups to:
24/08/26 21:12:04 INFO spark.SecurityManager: Changing modify acls groups to:
24/08/26 21:12:04 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
24/08/26 21:12:04 INFO Configuration.deprecation: mapred.output.compression.codec is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.codec
24/08/26 21:12:04 INFO Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
24/08/26 21:12:04 INFO Configuration.deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
24/08/26 21:12:04 INFO util.Utils: Successfully started service 'sparkDriver' on port 42985.
24/08/26 21:12:04 INFO spark.SparkEnv: Registering MapOutputTracker
24/08/26 21:12:04 INFO spark.SparkEnv: Registering BlockManagerMaster
24/08/26 21:12:04 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
24/08/26 21:12:04 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
24/08/26 21:12:04 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-f0cbf1d6-2863-462c-9d80-530876684a5a
24/08/26 21:12:04 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
24/08/26 21:12:04 INFO spark.SparkEnv: Registering OutputCommitCoordinator
24/08/26 21:12:04 INFO util.log: Logging initialized @736ms
24/08/26 21:12:04 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
24/08/26 21:12:04 INFO server.Server: Started @761ms
24/08/26 21:12:04 INFO server.AbstractConnector: Started ServerConnector@2c1dc8e{HTTP/1.1,[http/1.1]}{0.0.0.0:8090}
24/08/26 21:12:04 INFO util.Utils: Successfully started service 'SparkUI' on port 8090.
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5fa47fea{/jobs,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4e406694{/jobs/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5ab9b447{/jobs/job,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4f8caaf3{/jobs/job/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2b50150{/stages,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@15b986cd{/stages/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6bb7cce7{/stages/stage,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@328572f0{/stages/stage/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@678040b3{/stages/pool,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@17f460bb{/stages/pool/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@64a1923a{/storage,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7d2a6eac{/storage/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@18ca3c62{/storage/rdd,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2c0f7678{/storage/rdd/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@44d70181{/environment,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6aa648b9{/environment/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@23c650a3{/executors,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@742d4e15{/executors/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@88a8218{/executors/threadDump,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@50b1f030{/executors/threadDump/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4163f1cd{/static,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1b1637e1{/,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@18151a14{/api,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@ceb4bd2{/jobs/job/kill,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@60297f36{/stages/stage/kill,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://adhoc-2:8090
24/08/26 21:12:04 INFO spark.SparkContext: Added JAR file:/var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-utilities.jar at spark://adhoc-2:42985/jars/hoodie-utilities.jar with timestamp 1724706724533
24/08/26 21:12:04 INFO executor.Executor: Starting executor ID driver on host localhost
24/08/26 21:12:04 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42359.
24/08/26 21:12:04 INFO netty.NettyBlockTransferService: Server created on adhoc-2:42359
24/08/26 21:12:04 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
24/08/26 21:12:04 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, adhoc-2, 42359, None)
24/08/26 21:12:04 INFO storage.BlockManagerMasterEndpoint: Registering block manager adhoc-2:42359 with 366.3 MB RAM, BlockManagerId(driver, adhoc-2, 42359, None)
24/08/26 21:12:04 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, adhoc-2, 42359, None)
24/08/26 21:12:04 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, adhoc-2, 42359, None)
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1dcca8d3{/metrics/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 WARN config.DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
24/08/26 21:12:04 WARN config.DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
24/08/26 21:12:05 INFO server.AbstractConnector: Stopped Spark@2c1dc8e{HTTP/1.1,[http/1.1]}{0.0.0.0:8090}
24/08/26 21:12:05 INFO ui.SparkUI: Stopped Spark web UI at http://adhoc-2:8090
24/08/26 21:12:05 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
24/08/26 21:12:05 INFO memory.MemoryStore: MemoryStore cleared
24/08/26 21:12:05 INFO storage.BlockManager: BlockManager stopped
24/08/26 21:12:05 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
24/08/26 21:12:05 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
24/08/26 21:12:05 INFO spark.SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.NoSuchMethodError: scala.Function1.$init$(Lscala/Function1;)V
at org.apache.spark.sql.hudi.HoodieSparkSessionExtension.<init>(HoodieSparkSessionExtension.scala:28)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at org.apache.spark.sql.SparkSession$Builder.liftedTree1$1(SparkSession.scala:945)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
at org.apache.spark.sql.SQLContext$.getOrCreate(SQLContext.scala:1066)
at org.apache.spark.sql.SQLContext.getOrCreate(SQLContext.scala)
at org.apache.hudi.client.common.HoodieSparkEngineContext.<init>(HoodieSparkEngineContext.java:72)
at org.apache.hudi.utilities.streamer.HoodieStreamer.<init>(HoodieStreamer.java:166)
at org.apache.hudi.utilities.streamer.HoodieStreamer.<init>(HoodieStreamer.java:150)
at org.apache.hudi.utilities.streamer.HoodieStreamer.<init>(HoodieStreamer.java:136)
at org.apache.hudi.utilities.streamer.HoodieStreamer.main(HoodieStreamer.java:606)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
24/08/26 21:12:05 INFO util.ShutdownHookManager: Shutdown hook called
24/08/26 21:12:05 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-015224c9-692d-47a5-b2e4-45d649ae189a
24/08/26 21:12:05 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-4159ae0c-5c40-4dc6-aaf1-b4d54f578ee4
Trying to upgrade spark from 2.4.4 to 2.4.8 using https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-without-hadoop-scala-2.12.tgz
Upgrading to 2.4.8 did not solve anything. Still has the same scala issue.
From support issue https://github.com/apache/hudi/issues/10262, it seems like to solve the scala issue, you have to compile Hudi with Scala 2.11. Problem is that Hudi 0.15 doesn't support Scala 2.11.
Compiling with Scala 2.12 doesn't work. Trying 2.13.
It looks like a new demo has to be built. All the libraries are too old.
While the new demo is being built, you can use https://github.com/alberttwong/onehouse-demos/tree/main/trino-prestodb-spark-minio
Thank you, Albert, for your efforts. I'll look into the link you posted!
making changes to run_sync_tool.sh to make it easier to use. https://github.com/apache/hudi/pull/11848
raw updated commands
https://github.com/apache/spark/blob/v3.4.3/pom.xml#L122: hadoop 3.3.4, hive 2.3.9
https://github.com/apache/spark/blob/v3.5.2/pom.xml#L125: hadoop 3.3.4, hive 2.3.9
https://github.com/apache/hudi/blob/master/pom.xml#L187
//Run kakfa client and spark in the spark container
docker exec -it spark /bin/bash
//add messages to topic
cat /opt/demo/data/batch_1.json | kafkacat -b kafka:9092 -t stock_ticks -P
//check topic
kafkacat -b kafka -L -J | jq .
//create COW table on S3
spark-submit \
--packages org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.15.0,org.apache.hudi:hudi-spark3.4-bundle_2.12:0.15.0,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262 \
--class org.apache.hudi.utilities.streamer.HoodieStreamer org.apache.hudi_hudi-utilities-slim-bundle_2.12-0.15.0.jar \
--table-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field ts \
--target-base-path s3a://warehouse/stock_ticks_cow \
--target-table stock_ticks_cow \
--props file:///opt/demo/config/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
//create MOR table on S3
spark-submit \
--packages org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.15.0,org.apache.hudi:hudi-spark3.4-bundle_2.12:0.15.0,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262 \
--class org.apache.hudi.utilities.streamer.HoodieStreamer org.apache.hudi_hudi-utilities-slim-bundle_2.12-0.15.0.jar \
--table-type MERGE_ON_READ \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field ts \
--target-base-path s3a://warehouse/stock_ticks_mor \
--target-table stock_ticks_mor \
--props file:///opt/demo/config/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--disable-compaction
// need hadoop-aws 3.3.4 to make spark work but hudi sync needs hadoop-aws 2.10.2
org.apache.thrift:libthrift:0.13.0,org.apache.hadoop:hadoop-aws:2.10.2
//Run hudi sync in openjdk8 container
docker exec -it openjdk8 /bin/bash
/opt/hudi/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
--metastore-uris 'thrift://hive-metastore:9083' \
--partitioned-by dt \
--base-path 's3a://warehouse/stock_ticks_cow' \
--database default \
--table stock_ticks_cow \
--sync-mode hms \
--partition-value-extractor org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor
/opt/hudi/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
--metastore-uris 'thrift://hive-metastore:9083' \
--partitioned-by dt \
--base-path 's3a://warehouse/stock_ticks_mor' \
--database default \
--table stock_ticks_mor \
--sync-mode hms \
--partition-value-extractor org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor
//Run kakfa client and spark in the spark container
docker exec -it spark /bin/bash
// Run SQL queries
spark-sql --packages org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.15.0,org.apache.hudi:hudi-spark3.4-bundle_2.12:0.15.0,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
We are in testing with new instructions.
https://github.com/alberttwong/onehouse-demos/tree/main/hudi-spark-minio-trino
Describe the problem you faced
Trying to follow the official Docker demo tutorial here, at step 2, I get an error executing a command inside one of the containers.
To Reproduce
Steps to reproduce the behavior:
release-0.15.0
. Use a v1.8 JDK (I've used OpenJDK v1.8.0_422). Build with justmvn clean package -Pintegration-tests -DskipTests
. The build should complete without errors.adhoc-2
) Docker container. This throws a Scala exception.Expected behavior
I expect that not to throw an exception.
Environment Description
Hudi version : 0.15.0
Spark version : 3.5
Hive version : unsure
Hadoop version : unsure
Storage (HDFS/S3/GCS..) : irrelevant
Running on Docker? (yes/no) : yes
Additional context
Stacktrace
Full output: