[SUPPORT] Hive meta sync null pointer issue

bytesid19 commented 3 days ago

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

We recently migrated our hudi version from 0.10.1 to 0.15.0 in our spark jobs. Hudi options at that time

'hoodie.table.name': 'table_name',
'hoodie.datasource.write.table.name':'table_name',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator',
'hoodie.datasource.write.recordkey.field': hudi_record_key,
'hoodie.datasource.write.partitionpath.field': "txn_date",
'hoodie.datasource.write.precombine.field': 'ingestion_timestamp',
'hoodie.upsert.shuffle.parallelism': 100,
'hoodie.insert.shuffle.parallelism': 100

During that time hive meta sync was not enabled.

Now we had the requirement to sync data to hive via hms mode, for which we added following options to above

'hoodie.datasource.meta.sync.enable': 'true',
'hoodie.datasource.hive_sync.mode': 'hms',
'hoodie.database.name': '##db_name##',
'hoodie.datasource.hive_sync.metastore.uris': '##hive_uri##'

When we triggered a new spark job after that to write new data to the same hudi table, the job ran successfully and added the new data to hudi table but did not update hudi properties. But now whenever we are triggering the spark job to write new data, job is failing and we are getting null pointer exception at org.apache.hudi.common.table.timeline.TimelineUtils.lambda$null$5(TimelineUtils.java:114) at java.util.HashMap.forEach(HashMap.java:1290) at org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getDroppedPartitions$6(TimelineUtils.java:113)

To Reproduce

Steps to reproduce the behavior:

Migrate hudi version from 0.10.1 to 0.15.0 and start writing to hudi table w/o hive sync.
Add hive meta sync related properties to hudi options.
Write to hudi table via spark jobs.

Expected behavior

The job should complete successfully.
Hudi table to be updated with new data.
Hudi options to include hoodie.database.name.
Metadata to be synced to hive successfully.

Environment Description

Hudi version : 0.15.0
Spark version : 3.3.2
Hive version : 3.1.2
Hadoop version : 3.0.0
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace


: org.apache.hudi.exception.HoodieMetaSyncException: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool
    at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:81)
    at org.apache.hudi.HoodieSparkSqlWriterInternal.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:1015)
    at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
    at org.apache.hudi.HoodieSparkSqlWriterInternal.metaSync(HoodieSparkSqlWriter.scala:1013)
    at org.apache.hudi.HoodieSparkSqlWriterInternal.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:1112)
    at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:508)
    at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
    at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
    at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:49)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.$anonfun$sideEffectResult$1(commands.scala:82)
    at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:80)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:79)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$3(QueryExecution.scala:256)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:165)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$2(QueryExecution.scala:256)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$9(SQLExecution.scala:258)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:448)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:203)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1073)
    at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:131)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:398)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$1(QueryExecution.scala:255)
    at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$withMVTagsIfNecessary(QueryExecution.scala:238)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:251)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:244)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:519)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:106)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:519)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:339)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:335)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:495)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$eagerlyExecuteCommands$1(QueryExecution.scala:244)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:395)
    at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:244)
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:198)
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:189)
    at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:305)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:964)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:429)
    at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:396)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:250)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:306)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
    at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing LH820YOhaC01kt
    at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:170)
    at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:79)
    ... 58 more
Caused by: java.lang.NullPointerException
    at org.apache.hudi.common.table.timeline.TimelineUtils.lambda$null$5(TimelineUtils.java:114)
    at java.util.HashMap.forEach(HashMap.java:1290)
    at org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getDroppedPartitions$6(TimelineUtils.java:113)
    at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
    at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
    at org.apache.hudi.common.table.timeline.TimelineUtils.getDroppedPartitions(TimelineUtils.java:110)
    at org.apache.hudi.sync.common.HoodieSyncClient.getDroppedPartitionsSince(HoodieSyncClient.java:97)
    at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:289)
    at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:179)
    at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:167)
    ... 59 more```

danny0405 commented 3 days ago

It looks like the clean instant is missing:

          try {
            HoodieCleanMetadata cleanMetadata = TimelineMetadataUtils.deserializeHoodieCleanMetadata(cleanerTimeline.getInstantDetails(instant).get());
            cleanMetadata.getPartitionMetadata().forEach((partition, partitionMetadata) -> {
              if (partitionMetadata.getIsPartitionDeleted()) {
                partitionToLatestDeleteTimestamp.put(partition, instant.getTimestamp());
              }
            });
          }

bytesid19 commented 2 days ago

@danny0405 What is the above process doing and how will it help solve our use case ??

danny0405 commented 2 days ago

Did you check your clean metadata file and see what the partition metadata looks like?

bytesid19 commented 2 days ago

@danny0405 Before that I want to understand what you think might have happened which led you to propose the above solution. Why is hive sync throwing null pointer in this case ?

It would have worked fine without exceptions had we directly migrated hudi version from 0.10.1 to 0.15.0 with hive sync enabled in hudi properties.

But we migrated hudi version from 0.10.1 to 0.15.0 without hive sync initially. And then enabled it in hudi options.

Can you explain what difference would have led to this issue.

bytesid19 commented 2 days ago

Also my partition metadata in .clean file looks like

{
   "2023-11-02":{
      "partitionPath":"2023-11-02",
      "policy":"KEEP_LATEST_COMMITS",
      "deletePathPatterns":[
         "e3af0439-4282-4505-88f9-0c70b39c8882-0_0-189-1130_20231126175139078.parquet"
      ],
      "successDeleteFiles":[
         "e3af0439-4282-4505-88f9-0c70b39c8882-0_0-189-1130_20231126175139078.parquet"
      ],
      "failedDeleteFiles":[

      ],
      "isPartitionDeleted":false
   },
   "2023-11-03":{
      "partitionPath":"2023-11-03",
      "policy":"KEEP_LATEST_COMMITS",
      "deletePathPatterns":[
         "bc234ec1-0b3f-4a15-86b6-7df504f80551-0_0-53-1247_20231126174353127.parquet",
         "bc234ec1-0b3f-4a15-86b6-7df504f80551-0_0-189-1104_20231126172830225.parquet",
         "bc234ec1-0b3f-4a15-86b6-7df504f80551-0_0-245-1376_20231126173720486.parquet"
      ],
      "successDeleteFiles":[
         "bc234ec1-0b3f-4a15-86b6-7df504f80551-0_0-53-1247_20231126174353127.parquet",
         "bc234ec1-0b3f-4a15-86b6-7df504f80551-0_0-189-1104_20231126172830225.parquet",
         "bc234ec1-0b3f-4a15-86b6-7df504f80551-0_0-245-1376_20231126173720486.parquet"
      ],
      "failedDeleteFiles":[

      ],
      "isPartitionDeleted":false
   }
}

danny0405 commented 2 days ago

I have no idea why the hive sync could affect the hoodie properties, did you diff the hoodie.properties between the two kind of operations?

Why is hive sync throwing null pointer in this case

Just from the error stacktrace, it looks like related with the clean metadata.

bytesid19 commented 1 day ago

Hudi options while writing to hudi table when we migrated from 0.10.1 to 0.15.0

'hoodie.table.name': 'table_name',
'hoodie.datasource.write.table.name':'table_name',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator',
'hoodie.datasource.write.recordkey.field': hudi_record_key,
'hoodie.datasource.write.partitionpath.field': "txn_date",
'hoodie.datasource.write.precombine.field': 'ingestion_timestamp',
'hoodie.upsert.shuffle.parallelism': 100,
'hoodie.insert.shuffle.parallelism': 100

hoodie.properties after that migration

#Updated at 2024-09-03T12:04:40.615Z
#Tue Sep 03 12:04:40 UTC 2024
hoodie.table.precombine.field=ingestion_timestamp
hoodie.table.partition.fields=txn_date
hoodie.table.type=COPY_ON_WRITE
hoodie.archivelog.folder=archived
hoodie.populate.meta.fields=true
hoodie.timeline.layout.version=1
hoodie.table.version=6
hoodie.table.metadata.partitions=files
hoodie.table.recordkey.fields=record_id
hoodie.table.base.file.format=PARQUET
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.table.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.table.name=table_name
hoodie.table.metadata.partitions.inflight=
hoodie.datasource.write.hive_style_partitioning=false
hoodie.table.checksum=849148821

extra hudi options added later for hive sync

'hoodie.datasource.meta.sync.enable': 'true',
'hoodie.datasource.hive_sync.mode': 'hms',
'hoodie.database.name': '##db_name##',
'hoodie.datasource.hive_sync.metastore.uris': '##hive_uri##'

after adding above options hoodie.properties never changed.

bytesid19 commented 1 day ago

@danny0405 We are also trying hive sync tool, but that is also failing

Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
        at org.apache.hadoop.conf.Configuration.set(Configuration.java:1380)
        at org.apache.hadoop.conf.Configuration.set(Configuration.java:1361)
        at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:119)
        at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:108)
        at org.apache.hudi.hive.HiveSyncTool.main(HiveSyncTool.java:547)

when running

./run_sync_tool.sh  --sync-mode hms --metastore-uris uri:9083/qbdb --partitioned-by txn_date --base-path s3://abc/bytesid/table --database test --table table

hadoop version : Hadoop 3.3.0 hive version : 3.1.3 Please help here

danny0405 commented 1 day ago

Exception in thread "main" java.lang.NoSuchMethodError

This is a hadoop jar conflict.

bytesid19 commented 12 hours ago

Was able to solve for this but still getting other exceptions Steps

git clone https://github.com/apache/hudi.git (master)
built the whole package
installed hadoop 2.10.2 - set HADOOP_HOME and path
installed hive 2.3.4 - set HIVE_HOME and path
ran ./run_sync_tool.sh --sync-mode hms --metastore-uris uri:9083/qbdb --partitioned-by txn_date --base-path s3://abc/bytesid/table --database test --table table

Getting exception

Exception in thread “main” org.apache.hudi.exception.HoodieException: Unable to create org.apache.hudi.storage.hadoop.HoodieHadoopStorage
        at org.apache.hudi.storage.HoodieStorageUtils.getStorage(HoodieStorageUtils.java:44)
        at org.apache.hudi.common.table.HoodieTableMetaClient.getStorage(HoodieTableMetaClient.java:395)
        at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:103)
        at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:909)
        at org.apache.hudi.sync.common.HoodieSyncTool.buildMetaClient(HoodieSyncTool.java:75)
        at org.apache.hudi.hive.HiveSyncTool.lambda$new$0(HiveSyncTool.java:128)
        at org.apache.hudi.common.util.Option.orElseGet(Option.java:153)
        at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:128)
        at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:108)
        at org.apache.hudi.hive.HiveSyncTool.main(HiveSyncTool.java:547)
Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class org.apache.hudi.storage.hadoop.HoodieHadoopStorage
        at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75)
        at org.apache.hudi.storage.HoodieStorageUtils.getStorage(HoodieStorageUtils.java:41)
        ... 9 more
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:73)
        ... 10 more
Caused by: org.apache.hudi.exception.HoodieIOException: Failed to get instance of org.apache.hadoop.fs.FileSystem
        at org.apache.hudi.hadoop.fs.HadoopFSUtils.getFs(HadoopFSUtils.java:128)
        at org.apache.hudi.hadoop.fs.HadoopFSUtils.getFs(HadoopFSUtils.java:119)
        at org.apache.hudi.storage.hadoop.HoodieHadoopStorage.<init>(HoodieHadoopStorage.java:64)
        ... 15 more
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme “s3"
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3239)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3259)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
        at org.apache.hudi.hadoop.fs.HadoopFSUtils.getFs(HadoopFSUtils.java:126)
        ... 17 more

Please help here.

danny0405 commented 1 hour ago

you need to add the aws sdk jar which contains the s3 filesystem.

apache / hudi

[SUPPORT] Hive meta sync null pointer issue #11955