Open bytesid19 opened 3 days ago
It looks like the clean instant is missing:
try {
HoodieCleanMetadata cleanMetadata = TimelineMetadataUtils.deserializeHoodieCleanMetadata(cleanerTimeline.getInstantDetails(instant).get());
cleanMetadata.getPartitionMetadata().forEach((partition, partitionMetadata) -> {
if (partitionMetadata.getIsPartitionDeleted()) {
partitionToLatestDeleteTimestamp.put(partition, instant.getTimestamp());
}
});
}
@danny0405 What is the above process doing and how will it help solve our use case ??
Did you check your clean metadata file and see what the partition metadata looks like?
@danny0405 Before that I want to understand what you think might have happened which led you to propose the above solution. Why is hive sync throwing null pointer in this case ?
It would have worked fine without exceptions had we directly migrated hudi version from 0.10.1 to 0.15.0 with hive sync enabled in hudi properties.
But we migrated hudi version from 0.10.1 to 0.15.0 without hive sync initially. And then enabled it in hudi options.
Can you explain what difference would have led to this issue.
Also my partition metadata in .clean file looks like
{
"2023-11-02":{
"partitionPath":"2023-11-02",
"policy":"KEEP_LATEST_COMMITS",
"deletePathPatterns":[
"e3af0439-4282-4505-88f9-0c70b39c8882-0_0-189-1130_20231126175139078.parquet"
],
"successDeleteFiles":[
"e3af0439-4282-4505-88f9-0c70b39c8882-0_0-189-1130_20231126175139078.parquet"
],
"failedDeleteFiles":[
],
"isPartitionDeleted":false
},
"2023-11-03":{
"partitionPath":"2023-11-03",
"policy":"KEEP_LATEST_COMMITS",
"deletePathPatterns":[
"bc234ec1-0b3f-4a15-86b6-7df504f80551-0_0-53-1247_20231126174353127.parquet",
"bc234ec1-0b3f-4a15-86b6-7df504f80551-0_0-189-1104_20231126172830225.parquet",
"bc234ec1-0b3f-4a15-86b6-7df504f80551-0_0-245-1376_20231126173720486.parquet"
],
"successDeleteFiles":[
"bc234ec1-0b3f-4a15-86b6-7df504f80551-0_0-53-1247_20231126174353127.parquet",
"bc234ec1-0b3f-4a15-86b6-7df504f80551-0_0-189-1104_20231126172830225.parquet",
"bc234ec1-0b3f-4a15-86b6-7df504f80551-0_0-245-1376_20231126173720486.parquet"
],
"failedDeleteFiles":[
],
"isPartitionDeleted":false
}
}
I have no idea why the hive sync could affect the hoodie properties, did you diff the hoodie.properties between the two kind of operations?
Why is hive sync throwing null pointer in this case
Just from the error stacktrace, it looks like related with the clean metadata.
Hudi options while writing to hudi table when we migrated from 0.10.1 to 0.15.0
'hoodie.table.name': 'table_name',
'hoodie.datasource.write.table.name':'table_name',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator',
'hoodie.datasource.write.recordkey.field': hudi_record_key,
'hoodie.datasource.write.partitionpath.field': "txn_date",
'hoodie.datasource.write.precombine.field': 'ingestion_timestamp',
'hoodie.upsert.shuffle.parallelism': 100,
'hoodie.insert.shuffle.parallelism': 100
hoodie.properties after that migration
#Updated at 2024-09-03T12:04:40.615Z
#Tue Sep 03 12:04:40 UTC 2024
hoodie.table.precombine.field=ingestion_timestamp
hoodie.table.partition.fields=txn_date
hoodie.table.type=COPY_ON_WRITE
hoodie.archivelog.folder=archived
hoodie.populate.meta.fields=true
hoodie.timeline.layout.version=1
hoodie.table.version=6
hoodie.table.metadata.partitions=files
hoodie.table.recordkey.fields=record_id
hoodie.table.base.file.format=PARQUET
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.table.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.table.name=table_name
hoodie.table.metadata.partitions.inflight=
hoodie.datasource.write.hive_style_partitioning=false
hoodie.table.checksum=849148821
extra hudi options added later for hive sync
'hoodie.datasource.meta.sync.enable': 'true',
'hoodie.datasource.hive_sync.mode': 'hms',
'hoodie.database.name': '##db_name##',
'hoodie.datasource.hive_sync.metastore.uris': '##hive_uri##'
after adding above options hoodie.properties never changed.
@danny0405 We are also trying hive sync tool, but that is also failing
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1380)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1361)
at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:119)
at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:108)
at org.apache.hudi.hive.HiveSyncTool.main(HiveSyncTool.java:547)
when running
./run_sync_tool.sh --sync-mode hms --metastore-uris uri:9083/qbdb --partitioned-by txn_date --base-path s3://abc/bytesid/table --database test --table table
hadoop version : Hadoop 3.3.0 hive version : 3.1.3 Please help here
Exception in thread "main" java.lang.NoSuchMethodError
This is a hadoop jar conflict.
Was able to solve for this but still getting other exceptions Steps
Getting exception
Exception in thread “main” org.apache.hudi.exception.HoodieException: Unable to create org.apache.hudi.storage.hadoop.HoodieHadoopStorage
at org.apache.hudi.storage.HoodieStorageUtils.getStorage(HoodieStorageUtils.java:44)
at org.apache.hudi.common.table.HoodieTableMetaClient.getStorage(HoodieTableMetaClient.java:395)
at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:103)
at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:909)
at org.apache.hudi.sync.common.HoodieSyncTool.buildMetaClient(HoodieSyncTool.java:75)
at org.apache.hudi.hive.HiveSyncTool.lambda$new$0(HiveSyncTool.java:128)
at org.apache.hudi.common.util.Option.orElseGet(Option.java:153)
at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:128)
at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:108)
at org.apache.hudi.hive.HiveSyncTool.main(HiveSyncTool.java:547)
Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class org.apache.hudi.storage.hadoop.HoodieHadoopStorage
at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75)
at org.apache.hudi.storage.HoodieStorageUtils.getStorage(HoodieStorageUtils.java:41)
... 9 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:73)
... 10 more
Caused by: org.apache.hudi.exception.HoodieIOException: Failed to get instance of org.apache.hadoop.fs.FileSystem
at org.apache.hudi.hadoop.fs.HadoopFSUtils.getFs(HadoopFSUtils.java:128)
at org.apache.hudi.hadoop.fs.HadoopFSUtils.getFs(HadoopFSUtils.java:119)
at org.apache.hudi.storage.hadoop.HoodieHadoopStorage.<init>(HoodieHadoopStorage.java:64)
... 15 more
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme “s3"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3239)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3259)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3310)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3278)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at org.apache.hudi.hadoop.fs.HadoopFSUtils.getFs(HadoopFSUtils.java:126)
... 17 more
Please help here.
you need to add the aws sdk jar which contains the s3 filesystem.
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
We recently migrated our hudi version from 0.10.1 to 0.15.0 in our spark jobs. Hudi options at that time
During that time hive meta sync was not enabled.
Now we had the requirement to sync data to hive via hms mode, for which we added following options to above
When we triggered a new spark job after that to write new data to the same hudi table, the job ran successfully and added the new data to hudi table but did not update hudi properties. But now whenever we are triggering the spark job to write new data, job is failing and we are getting null pointer exception at org.apache.hudi.common.table.timeline.TimelineUtils.lambda$null$5(TimelineUtils.java:114) at java.util.HashMap.forEach(HashMap.java:1290) at org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getDroppedPartitions$6(TimelineUtils.java:113)
To Reproduce
Steps to reproduce the behavior:
Expected behavior
hoodie.database.name
.Environment Description
Hudi version : 0.15.0
Spark version : 3.3.2
Hive version : 3.1.2
Hadoop version : 3.0.0
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
Add any other context about the problem here.
Stacktrace