apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.33k stars 2.42k forks source link

[SUPPORT] Docker demo on website does not work as expected #11797

Open curio77 opened 4 weeks ago

curio77 commented 4 weeks ago

Describe the problem you faced

Trying to follow the official Docker demo tutorial here, at step 2, I get an error executing a command inside one of the containers.

To Reproduce

Steps to reproduce the behavior:

  1. Follow the tutorial up to Step 2. Clone the Hudi repo at tag release-0.15.0. Use a v1.8 JDK (I've used OpenJDK v1.8.0_422). Build with just mvn clean package -Pintegration-tests -DskipTests. The build should complete without errors.
  2. Run the first command of Step 2 inside the (adhoc-2) Docker container. This throws a Scala exception.

Expected behavior

I expect that not to throw an exception.

Environment Description

Additional context

Stacktrace

Exception in thread "main" java.lang.NoSuchMethodError: scala.Function1.$init$(Lscala/Function1;)V
        at org.apache.spark.sql.hudi.HoodieSparkSessionExtension.<init>(HoodieSparkSessionExtension.scala:28)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at java.lang.Class.newInstance(Class.java:442)
        at org.apache.spark.sql.SparkSession$Builder.liftedTree1$1(SparkSession.scala:945)
        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
        at org.apache.spark.sql.SQLContext$.getOrCreate(SQLContext.scala:1066)
        at org.apache.spark.sql.SQLContext.getOrCreate(SQLContext.scala)
        at org.apache.hudi.client.common.HoodieSparkEngineContext.<init>(HoodieSparkEngineContext.java:72)
        at org.apache.hudi.utilities.streamer.HoodieStreamer.<init>(HoodieStreamer.java:166)
        at org.apache.hudi.utilities.streamer.HoodieStreamer.<init>(HoodieStreamer.java:150)
        at org.apache.hudi.utilities.streamer.HoodieStreamer.<init>(HoodieStreamer.java:136)
        at org.apache.hudi.utilities.streamer.HoodieStreamer.main(HoodieStreamer.java:606)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Full output:

root@adhoc-2:/opt# spark-submit \
>   --class org.apache.hudi.utilities.streamer.HoodieStreamer $HUDI_UTILITIES_BUNDLE \
>   --table-type COPY_ON_WRITE \
>   --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
>   --source-ordering-field ts  \
>   --target-base-path /user/hive/warehouse/stock_ticks_cow \
>   --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties \
>   --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
24/08/19 14:50:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/08/19 14:50:38 WARN streamer.SchedulerConfGenerator: Job Scheduling Configs will not be in effect as spark.scheduler.mode is not set to FAIR at instantiation time. Continuing without scheduling configs
24/08/19 14:50:38 INFO spark.SparkContext: Running Spark version 2.4.4
24/08/19 14:50:38 INFO spark.SparkContext: Submitted application: streamer-stock_ticks_cow
24/08/19 14:50:38 INFO spark.SecurityManager: Changing view acls to: root
24/08/19 14:50:38 INFO spark.SecurityManager: Changing modify acls to: root
24/08/19 14:50:38 INFO spark.SecurityManager: Changing view acls groups to: 
24/08/19 14:50:38 INFO spark.SecurityManager: Changing modify acls groups to: 
24/08/19 14:50:38 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
24/08/19 14:50:38 INFO Configuration.deprecation: mapred.output.compression.codec is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.codec
24/08/19 14:50:38 INFO Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
24/08/19 14:50:38 INFO Configuration.deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
24/08/19 14:50:39 INFO util.Utils: Successfully started service 'sparkDriver' on port 35325.
24/08/19 14:50:39 INFO spark.SparkEnv: Registering MapOutputTracker
24/08/19 14:50:39 INFO spark.SparkEnv: Registering BlockManagerMaster
24/08/19 14:50:39 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
24/08/19 14:50:39 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
24/08/19 14:50:39 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-6e3a4f57-0460-41ca-a384-2c35b2906e4c
24/08/19 14:50:39 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
24/08/19 14:50:39 INFO spark.SparkEnv: Registering OutputCommitCoordinator
24/08/19 14:50:39 INFO util.log: Logging initialized @1449ms
24/08/19 14:50:39 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
24/08/19 14:50:39 INFO server.Server: Started @1492ms
24/08/19 14:50:39 INFO server.AbstractConnector: Started ServerConnector@44ea608c{HTTP/1.1,[http/1.1]}{0.0.0.0:8090}
24/08/19 14:50:39 INFO util.Utils: Successfully started service 'SparkUI' on port 8090.
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3f3ddbd9{/jobs,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@62b3df3a{/jobs/json,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@420745d7{/jobs/job,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5fa47fea{/jobs/job/json,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2392212b{/stages,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5b43e173{/stages/json,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@28f8e165{/stages/stage,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@22fa55b2{/stages/stage/json,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4d666b41{/stages/pool,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6594402a{/stages/pool/json,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@30f4b1a6{/storage,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@405325cf{/storage/json,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3e1162e7{/storage/rdd,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@79c3f01f{/storage/rdd/json,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6c2f1700{/environment,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@350b3a17{/environment/json,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@38600b{/executors,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@669d2b1b{/executors/json,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@721eb7df{/executors/threadDump,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1ea9f009{/executors/threadDump/json,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5d52e3ef{/static,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2c0f7678{/,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@44d70181{/api,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@88a8218{/jobs/job/kill,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@50b1f030{/stages/stage/kill,null,AVAILABLE,@Spark}
24/08/19 14:50:39 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://adhoc-2:8090
24/08/19 14:50:39 INFO spark.SparkContext: Added JAR file:/var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-utilities.jar at spark://adhoc-2:35325/jars/hoodie-utilities.jar with timestamp 1724079039233
24/08/19 14:50:39 INFO executor.Executor: Starting executor ID driver on host localhost
24/08/19 14:50:39 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38309.
24/08/19 14:50:39 INFO netty.NettyBlockTransferService: Server created on adhoc-2:38309
24/08/19 14:50:39 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
24/08/19 14:50:39 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, adhoc-2, 38309, None)
24/08/19 14:50:39 INFO storage.BlockManagerMasterEndpoint: Registering block manager adhoc-2:38309 with 366.3 MB RAM, BlockManagerId(driver, adhoc-2, 38309, None)
24/08/19 14:50:39 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, adhoc-2, 38309, None)
24/08/19 14:50:39 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, adhoc-2, 38309, None)
24/08/19 14:50:39 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@14fc5d40{/metrics/json,null,AVAILABLE,@Spark}
24/08/19 14:50:39 WARN config.DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
24/08/19 14:50:39 WARN config.DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
24/08/19 14:50:39 INFO server.AbstractConnector: Stopped Spark@44ea608c{HTTP/1.1,[http/1.1]}{0.0.0.0:8090}
24/08/19 14:50:39 INFO ui.SparkUI: Stopped Spark web UI at http://adhoc-2:8090
24/08/19 14:50:39 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
24/08/19 14:50:39 INFO memory.MemoryStore: MemoryStore cleared
24/08/19 14:50:39 INFO storage.BlockManager: BlockManager stopped
24/08/19 14:50:39 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
24/08/19 14:50:39 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
24/08/19 14:50:39 INFO spark.SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.NoSuchMethodError: scala.Function1.$init$(Lscala/Function1;)V
        at org.apache.spark.sql.hudi.HoodieSparkSessionExtension.<init>(HoodieSparkSessionExtension.scala:28)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at java.lang.Class.newInstance(Class.java:442)
        at org.apache.spark.sql.SparkSession$Builder.liftedTree1$1(SparkSession.scala:945)
        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
        at org.apache.spark.sql.SQLContext$.getOrCreate(SQLContext.scala:1066)
        at org.apache.spark.sql.SQLContext.getOrCreate(SQLContext.scala)
        at org.apache.hudi.client.common.HoodieSparkEngineContext.<init>(HoodieSparkEngineContext.java:72)
        at org.apache.hudi.utilities.streamer.HoodieStreamer.<init>(HoodieStreamer.java:166)
        at org.apache.hudi.utilities.streamer.HoodieStreamer.<init>(HoodieStreamer.java:150)
        at org.apache.hudi.utilities.streamer.HoodieStreamer.<init>(HoodieStreamer.java:136)
        at org.apache.hudi.utilities.streamer.HoodieStreamer.main(HoodieStreamer.java:606)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
24/08/19 14:50:39 INFO util.ShutdownHookManager: Shutdown hook called
24/08/19 14:50:39 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-f23d0542-c7eb-4a7e-9f50-e0f6fcfd5722
24/08/19 14:50:39 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-7911bc3e-ab03-479f-b199-af17b548d6e7
danny0405 commented 4 weeks ago

It's look like a scala version conflict.

curio77 commented 4 weeks ago

Yes, but how to resolve? This happens following that tutorial without any adjustments with respect to Scala. I've tried alternative Scala versions already by adding, e.g., -Dscala-2.11 to the mvn package command, but then it fails already at the build stage.

danny0405 commented 4 weeks ago

@ad1happy2go Do you have some insights here?

ad1happy2go commented 3 weeks ago

@curio77 I am trying to setup in my machine. Will update.

alberttwong commented 3 weeks ago

related. https://github.com/apache/hudi/issues/11826

alberttwong commented 3 weeks ago

confirmed.

albert@Alberts-MBP docker % docker exec -it adhoc-2 /bin/bash
root@adhoc-2:/opt# spark-submit \
>   --class org.apache.hudi.utilities.streamer.HoodieStreamer $HUDI_UTILITIES_BUNDLE \
>   --table-type COPY_ON_WRITE \
>   --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
>   --source-ordering-field ts  \
>   --target-base-path /user/hive/warehouse/stock_ticks_cow \
>   --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties \
>   --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
24/08/26 21:12:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/08/26 21:12:04 WARN streamer.SchedulerConfGenerator: Job Scheduling Configs will not be in effect as spark.scheduler.mode is not set to FAIR at instantiation time. Continuing without scheduling configs
24/08/26 21:12:04 INFO spark.SparkContext: Running Spark version 2.4.4
24/08/26 21:12:04 INFO spark.SparkContext: Submitted application: streamer-stock_ticks_cow
24/08/26 21:12:04 INFO spark.SecurityManager: Changing view acls to: root
24/08/26 21:12:04 INFO spark.SecurityManager: Changing modify acls to: root
24/08/26 21:12:04 INFO spark.SecurityManager: Changing view acls groups to:
24/08/26 21:12:04 INFO spark.SecurityManager: Changing modify acls groups to:
24/08/26 21:12:04 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
24/08/26 21:12:04 INFO Configuration.deprecation: mapred.output.compression.codec is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.codec
24/08/26 21:12:04 INFO Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
24/08/26 21:12:04 INFO Configuration.deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
24/08/26 21:12:04 INFO util.Utils: Successfully started service 'sparkDriver' on port 42985.
24/08/26 21:12:04 INFO spark.SparkEnv: Registering MapOutputTracker
24/08/26 21:12:04 INFO spark.SparkEnv: Registering BlockManagerMaster
24/08/26 21:12:04 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
24/08/26 21:12:04 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
24/08/26 21:12:04 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-f0cbf1d6-2863-462c-9d80-530876684a5a
24/08/26 21:12:04 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
24/08/26 21:12:04 INFO spark.SparkEnv: Registering OutputCommitCoordinator
24/08/26 21:12:04 INFO util.log: Logging initialized @736ms
24/08/26 21:12:04 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
24/08/26 21:12:04 INFO server.Server: Started @761ms
24/08/26 21:12:04 INFO server.AbstractConnector: Started ServerConnector@2c1dc8e{HTTP/1.1,[http/1.1]}{0.0.0.0:8090}
24/08/26 21:12:04 INFO util.Utils: Successfully started service 'SparkUI' on port 8090.
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5fa47fea{/jobs,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4e406694{/jobs/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5ab9b447{/jobs/job,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4f8caaf3{/jobs/job/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2b50150{/stages,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@15b986cd{/stages/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6bb7cce7{/stages/stage,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@328572f0{/stages/stage/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@678040b3{/stages/pool,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@17f460bb{/stages/pool/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@64a1923a{/storage,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7d2a6eac{/storage/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@18ca3c62{/storage/rdd,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2c0f7678{/storage/rdd/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@44d70181{/environment,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6aa648b9{/environment/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@23c650a3{/executors,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@742d4e15{/executors/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@88a8218{/executors/threadDump,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@50b1f030{/executors/threadDump/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4163f1cd{/static,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1b1637e1{/,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@18151a14{/api,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@ceb4bd2{/jobs/job/kill,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@60297f36{/stages/stage/kill,null,AVAILABLE,@Spark}
24/08/26 21:12:04 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://adhoc-2:8090
24/08/26 21:12:04 INFO spark.SparkContext: Added JAR file:/var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-utilities.jar at spark://adhoc-2:42985/jars/hoodie-utilities.jar with timestamp 1724706724533
24/08/26 21:12:04 INFO executor.Executor: Starting executor ID driver on host localhost
24/08/26 21:12:04 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42359.
24/08/26 21:12:04 INFO netty.NettyBlockTransferService: Server created on adhoc-2:42359
24/08/26 21:12:04 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
24/08/26 21:12:04 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, adhoc-2, 42359, None)
24/08/26 21:12:04 INFO storage.BlockManagerMasterEndpoint: Registering block manager adhoc-2:42359 with 366.3 MB RAM, BlockManagerId(driver, adhoc-2, 42359, None)
24/08/26 21:12:04 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, adhoc-2, 42359, None)
24/08/26 21:12:04 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, adhoc-2, 42359, None)
24/08/26 21:12:04 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1dcca8d3{/metrics/json,null,AVAILABLE,@Spark}
24/08/26 21:12:04 WARN config.DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
24/08/26 21:12:04 WARN config.DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
24/08/26 21:12:05 INFO server.AbstractConnector: Stopped Spark@2c1dc8e{HTTP/1.1,[http/1.1]}{0.0.0.0:8090}
24/08/26 21:12:05 INFO ui.SparkUI: Stopped Spark web UI at http://adhoc-2:8090
24/08/26 21:12:05 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
24/08/26 21:12:05 INFO memory.MemoryStore: MemoryStore cleared
24/08/26 21:12:05 INFO storage.BlockManager: BlockManager stopped
24/08/26 21:12:05 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
24/08/26 21:12:05 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
24/08/26 21:12:05 INFO spark.SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.NoSuchMethodError: scala.Function1.$init$(Lscala/Function1;)V
    at org.apache.spark.sql.hudi.HoodieSparkSessionExtension.<init>(HoodieSparkSessionExtension.scala:28)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at java.lang.Class.newInstance(Class.java:442)
    at org.apache.spark.sql.SparkSession$Builder.liftedTree1$1(SparkSession.scala:945)
    at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
    at org.apache.spark.sql.SQLContext$.getOrCreate(SQLContext.scala:1066)
    at org.apache.spark.sql.SQLContext.getOrCreate(SQLContext.scala)
    at org.apache.hudi.client.common.HoodieSparkEngineContext.<init>(HoodieSparkEngineContext.java:72)
    at org.apache.hudi.utilities.streamer.HoodieStreamer.<init>(HoodieStreamer.java:166)
    at org.apache.hudi.utilities.streamer.HoodieStreamer.<init>(HoodieStreamer.java:150)
    at org.apache.hudi.utilities.streamer.HoodieStreamer.<init>(HoodieStreamer.java:136)
    at org.apache.hudi.utilities.streamer.HoodieStreamer.main(HoodieStreamer.java:606)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
24/08/26 21:12:05 INFO util.ShutdownHookManager: Shutdown hook called
24/08/26 21:12:05 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-015224c9-692d-47a5-b2e4-45d649ae189a
24/08/26 21:12:05 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-4159ae0c-5c40-4dc6-aaf1-b4d54f578ee4
alberttwong commented 3 weeks ago

Trying to upgrade spark from 2.4.4 to 2.4.8 using https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-without-hadoop-scala-2.12.tgz

alberttwong commented 3 weeks ago

Per https://github.com/apache/hudi/blob/e0ef86421993c1664da5cbb3ab7de7e87f16cb49/docker/hoodie/hadoop/spark_base/Dockerfile#L37, I'm supposed to use https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz

alberttwong commented 3 weeks ago

Upgrading to 2.4.8 did not solve anything. Still has the same scala issue.

From support issue https://github.com/apache/hudi/issues/10262, it seems like to solve the scala issue, you have to compile Hudi with Scala 2.11. Problem is that Hudi 0.15 doesn't support Scala 2.11.

alberttwong commented 3 weeks ago

Compiling with Scala 2.12 doesn't work. Trying 2.13.

alberttwong commented 3 weeks ago

It looks like a new demo has to be built. All the libraries are too old.

alberttwong commented 3 weeks ago

While the new demo is being built, you can use https://github.com/alberttwong/onehouse-demos/tree/main/trino-prestodb-spark-minio

curio77 commented 3 weeks ago

Thank you, Albert, for your efforts. I'll look into the link you posted!

alberttwong commented 3 weeks ago

https://github.com/apache/hudi/issues/11841

alberttwong commented 2 weeks ago

making changes to run_sync_tool.sh to make it easier to use. https://github.com/apache/hudi/pull/11848

alberttwong commented 2 weeks ago

raw updated commands

https://github.com/apache/spark/blob/v3.4.3/pom.xml#L122: hadoop 3.3.4, hive 2.3.9
https://github.com/apache/spark/blob/v3.5.2/pom.xml#L125: hadoop 3.3.4, hive 2.3.9
https://github.com/apache/hudi/blob/master/pom.xml#L187

//Run kakfa client and spark in the spark container
docker exec -it spark /bin/bash

//add messages to topic
cat /opt/demo/data/batch_1.json | kafkacat -b kafka:9092 -t stock_ticks -P

//check topic
kafkacat -b kafka -L -J | jq .

//create COW table on S3
spark-submit \
  --packages org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.15.0,org.apache.hudi:hudi-spark3.4-bundle_2.12:0.15.0,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262 \
  --class org.apache.hudi.utilities.streamer.HoodieStreamer org.apache.hudi_hudi-utilities-slim-bundle_2.12-0.15.0.jar \
  --table-type COPY_ON_WRITE \
  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
  --source-ordering-field ts  \
  --target-base-path s3a://warehouse/stock_ticks_cow \
  --target-table stock_ticks_cow \
  --props file:///opt/demo/config/kafka-source.properties \
  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider

//create MOR table on S3
  spark-submit \
  --packages org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.15.0,org.apache.hudi:hudi-spark3.4-bundle_2.12:0.15.0,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262 \
  --class org.apache.hudi.utilities.streamer.HoodieStreamer org.apache.hudi_hudi-utilities-slim-bundle_2.12-0.15.0.jar \
  --table-type MERGE_ON_READ \
  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
  --source-ordering-field ts \
  --target-base-path s3a://warehouse/stock_ticks_mor \
  --target-table stock_ticks_mor \
  --props file:///opt/demo/config/kafka-source.properties \
  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
  --disable-compaction

// need hadoop-aws 3.3.4 to make spark work but hudi sync needs hadoop-aws 2.10.2
org.apache.thrift:libthrift:0.13.0,org.apache.hadoop:hadoop-aws:2.10.2 

//Run hudi sync in openjdk8 container
docker exec -it openjdk8 /bin/bash

/opt/hudi/hudi-sync/hudi-hive-sync/run_sync_tool.sh  \
--metastore-uris 'thrift://hive-metastore:9083' \
--partitioned-by dt \
--base-path 's3a://warehouse/stock_ticks_cow' \
--database default \
--table stock_ticks_cow \
--sync-mode hms \
--partition-value-extractor org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor

/opt/hudi/hudi-sync/hudi-hive-sync/run_sync_tool.sh  \
--metastore-uris 'thrift://hive-metastore:9083' \
--partitioned-by dt \
--base-path 's3a://warehouse/stock_ticks_mor' \
--database default \
--table stock_ticks_mor \
--sync-mode hms \
--partition-value-extractor org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor

//Run kakfa client and spark in the spark container
docker exec -it spark /bin/bash

// Run SQL queries
spark-sql --packages org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.15.0,org.apache.hudi:hudi-spark3.4-bundle_2.12:0.15.0,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
alberttwong commented 1 week ago

We are in testing with new instructions.

https://github.com/alberttwong/onehouse-demos/tree/main/hudi-spark-minio-trino