apache / incubator-xtable

Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.
https://xtable.apache.org/
Apache License 2.0
919 stars 147 forks source link

Error while creating Iceberg format from Hudi source using S3 bucket as tableBasePath location in config file. #433

Closed buddhayan closed 6 months ago

buddhayan commented 6 months ago

I encountered an issue while attempting to convert Hudi to Iceberg format. When I provide a tableBasePath as a local file path, the conversion works fine. However, when I use tableBasePath as an S3 bucket, I encounter the below error. I'm testing this functionality from my AWS Cloud9 (EC2) instance. Please review the config file and error message provided, and advise if there's something I'm missing.

I followed the documentation (Creating your first interoperable table) to build the utilities-0.1.0-SNAPSHOT-bundled.jar and people hudi dataset. Then executed below command from AWS Cloud9 instance terminal, java -jar utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.local.yaml

Config file my_config.yaml

sourceFormat: HUDI
targetFormats:
  - ICEBERG
datasets:
  -
    tableBasePath: s3://bucket-name-eu-west-1/temp/xtable_data/people/
    tableName: people
    partitionSpec: city:VALUE

Error:

~/environment/xtable-poc $ java -jar utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.local.yaml
WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.
2024-05-08 18:17:36 INFO  org.apache.xtable.utilities.RunSync:148 - Running sync for basePath s3://bucket-name-eu-west-1/temp/xtable_data/people/ for following table formats [ICEBERG]
2024-05-08 18:17:36 INFO  org.apache.hudi.common.table.HoodieTableMetaClient:133 - Loading HoodieTableMetaClient from s3://bucket-name-eu-west-1/temp/xtable_data/people
2024-05-08 18:17:36 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-05-08 18:17:37 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-05-08 18:17:37 ERROR org.apache.hadoop.metrics2.impl.MetricsSystemImpl:555 - Error getting localhost name. Using 'localhost'...
java.net.UnknownHostException: ip-**-**-**-***: ip-**-**-**-***: Name or service not known
        at java.net.InetAddress.getLocalHost(InetAddress.java:1670) ~[?:?]
        at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.getHostname(MetricsSystemImpl.java:553) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.configureSystem(MetricsSystemImpl.java:489) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.configure(MetricsSystemImpl.java:485) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.start(MetricsSystemImpl.java:188) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:163) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.s3a.S3AInstrumentation.getMetricsSystem(S3AInstrumentation.java:249) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.s3a.S3AInstrumentation.registerAsMetricsSource(S3AInstrumentation.java:272) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:229) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:519) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:116) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hudi.common.table.HoodieTableMetaClient.getFs(HoodieTableMetaClient.java:308) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hudi.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:139) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:692) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:85) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:774) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.hudi.HudiConversionSourceProvider.getConversionSourceInstance(HudiConversionSourceProvider.java:42) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.hudi.HudiConversionSourceProvider.getConversionSourceInstance(HudiConversionSourceProvider.java:31) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.conversion.ConversionController.sync(ConversionController.java:92) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.utilities.RunSync.main(RunSync.java:169) [utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
Caused by: java.net.UnknownHostException: ip-**-**-**-***: Name or service not known
        at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) ~[?:?]
        at java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:930) ~[?:?]
        at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1543) ~[?:?]
        at java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848) ~[?:?]
        at java.net.InetAddress.getAllByName0(InetAddress.java:1533) ~[?:?]
        at java.net.InetAddress.getLocalHost(InetAddress.java:1665) ~[?:?]
        ... 25 more
2024-05-08 18:17:37 WARN  org.apache.hadoop.fs.s3a.SDKV2Upgrade:39 - Directly referencing AWS SDK V1 credential provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain. AWS SDK V1 credential providers will be removed once S3A is upgraded to SDK V2
2024-05-08 18:17:38 INFO  org.apache.hudi.common.table.HoodieTableConfig:276 - Loading table properties from s3://bucket-name-eu-west-1/temp/xtable_data/people/.hoodie/hoodie.properties
Exception in thread "main" java.lang.NoSuchMethodError: 'java.lang.Object org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(org.apache.hadoop.fs.statistics.DurationTracker, org.apache.hadoop.util.functional.CallableRaisingIOE)'
        at org.apache.hadoop.fs.s3a.Invoker.onceTrackingDuration(Invoker.java:147)
        at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:282)
        at org.apache.hadoop.fs.s3a.S3AInputStream.lambda$lazySeek$1(S3AInputStream.java:435)
        at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$3(Invoker.java:284)
        at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
        at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$5(Invoker.java:408)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
        at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:404)
        at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:282)
        at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:326)
        at org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:427)
        at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:545)
        at java.base/java.io.DataInputStream.read(DataInputStream.java:149)
        at java.base/java.io.DataInputStream.read(DataInputStream.java:100)
        at java.base/java.util.Properties$LineReader.readLine(Properties.java:502)
        at java.base/java.util.Properties.load0(Properties.java:418)
        at java.base/java.util.Properties.load(Properties.java:407)
        at org.apache.hudi.common.table.HoodieTableConfig.fetchConfigs(HoodieTableConfig.java:352)
        at org.apache.hudi.common.table.HoodieTableConfig.<init>(HoodieTableConfig.java:278)
        at org.apache.hudi.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:141)
        at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:692)
        at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:85)
        at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:774)
        at org.apache.xtable.hudi.HudiConversionSourceProvider.getConversionSourceInstance(HudiConversionSourceProvider.java:42)
        at org.apache.xtable.hudi.HudiConversionSourceProvider.getConversionSourceInstance(HudiConversionSourceProvider.java:31)
        at org.apache.xtable.conversion.ConversionController.sync(ConversionController.java:92)
        at org.apache.xtable.utilities.RunSync.main(RunSync.java:169)
~/environment/xtable-poc $ 
vinishjail97 commented 6 months ago

Then executed below command from AWS Cloud9 instance terminal

Does a basic s3 listing or get work using aws cli ?

java.net.UnknownHostException: ip----: ip----: Name or service not known

Thanks for reporting the issue @buddhayan, the above error looks more like a environment issue for you AWS instance where it's not able to identify the hostname if your EC2 instance. I have found few similar issues by googling and looks like some configuration related to /etc/hosts is required.

https://stackoverflow.com/questions/35325165/hadoop-java-net-unknownhostexception-hadoop-slave-2 https://stackoverflow.com/questions/6484275/java-net-unknownhostexception-invalid-hostname-for-server-local

buddhayan commented 6 months ago

Then executed below command from AWS Cloud9 instance terminal

Does a basic s3 listing or get work using aws cli ?

java.net.UnknownHostException: ip----_: ip---_-**: Name or service not known

Thanks for reporting the issue @buddhayan, the above error looks more like a environment issue for you AWS instance where it's not able to identify the hostname if your EC2 instance. I have found few similar issues by googling and looks like some configuration related to /etc/hosts is required.

https://stackoverflow.com/questions/35325165/hadoop-java-net-unknownhostexception-hadoop-slave-2 https://stackoverflow.com/questions/6484275/java-net-unknownhostexception-invalid-hostname-for-server-local

@vinishjail97 Yes, I can access the S3 bucket and execute AWS CLI commands such as 's3 ls', 'cp', and other operations directly from the terminal where I'm running the JAR file.

And the actual IP address was there in error log file. I have masked the actual ip address with ** from the above log before posting it here for security reason.

Not only with EC2, I have tried to execute the jar file from a AWS Glue job also but getting the same error while I am using s3 location as tableBasePath.

vinishjail97 commented 6 months ago

Can you share more details about your AWS environment ? I'm not able to re-produce the issue in my AWS environment. BTW is Java11 being used in your AWS environment for executing the jar ?

buddhayan commented 6 months ago

Yes, here's the Java configuration:

~/environment $ java --version openjdk 11.0.19 2023-04-18 LTS OpenJDK Runtime Environment Corretto-11.0.19.7.1 (build 11.0.19+7-LTS) OpenJDK 64-Bit Server VM Corretto-11.0.19.7.1 (build 11.0.19+7-LTS, mixed mode)

I don't believe the issue is related to the Java version. When I use 'tableBasePath' as a local path and execute the jar from same terminal, the synchronization works properly with the data set available locally and generates Iceberg metadata: tableBasePath: /home/ec2-user/environment/xtable_data/people/

Error Log: ` STD Output: 2024-05-13 09:22:52 INFO org.apache.xtable.utilities.RunSync:148 - Running sync for basePath s3://aws-glue-assets-XXXXX-eu-west-1/temp/xtable_data/people/ for following table formats [ICEBERG] 2024-05-13 09:22:52 INFO org.apache.hudi.common.table.HoodieTableMetaClient:133 - Loading HoodieTableMetaClient from s3://aws-glue-assets-XXXXX-eu-west-1/temp/xtable_data/people 2024-05-13 09:22:53 WARN org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties 2024-05-13 09:22:53 WARN org.apache.hadoop.fs.s3a.SDKV2Upgrade:39 - Directly referencing AWS SDK V1 credential provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain. AWS SDK V1 credential providers will be removed once S3A is upgraded to SDK V2 2024-05-13 09:22:54 INFO org.apache.hudi.common.table.HoodieTableConfig:276 - Loading table properties from s3://aws-glue-assets-XXXXX-eu-west-1/temp/xtable_data/people/.hoodie/hoodie.properties

Java process execution failed with return code: 1 An unexpected error occurred: Java process execution failed with return code: 1 and error: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(Lorg/apache/hadoop/fs/statistics/DurationTracker;Lorg/apache/hadoop/util/functional/CallableRaisingIOE;)Ljava/lang/Object; at org.apache.hadoop.fs.s3a.Invoker.onceTrackingDuration(Invoker.java:147) at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:282) at org.apache.hadoop.fs.s3a.S3AInputStream.lambda$lazySeek$1(S3AInputStream.java:435) at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$3(Invoker.java:284) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122) at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$5(Invoker.java:408) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468) at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:404) at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:282) at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:326) at org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:427) at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:545) at java.io.DataInputStream.read(DataInputStream.java:149) at java.io.DataInputStream.read(DataInputStream.java:100) at java.util.Properties$LineReader.readLine(Properties.java:435) at java.util.Properties.load0(Properties.java:353) at java.util.Properties.load(Properties.java:341) at org.apache.hudi.common.table.HoodieTableConfig.fetchConfigs(HoodieTableConfig.java:352) at org.apache.hudi.common.table.HoodieTableConfig.(HoodieTableConfig.java:278) at org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:141) at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:692) at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:85) at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:774) at org.apache.xtable.hudi.HudiConversionSourceProvider.getConversionSourceInstance(HudiConversionSourceProvider.java:42) at org.apache.xtable.hudi.HudiConversionSourceProvider.getConversionSourceInstance(HudiConversionSourceProvider.java:31) at org.apache.xtable.conversion.ConversionController.sync(ConversionController.java:92) at org.apache.xtable.utilities.RunSync.main(RunSync.java:169) ` From the log file looks like something related filesystem s3/s3a. I am getting this issue while using s3 bucket as data path.

Can you please have a look and confirm if I need to pass/set any additional configuration/aws credential except the config file while calling jar file?

vinishjail97 commented 6 months ago

Do you have multiple hadoop dependencies in your AWS environment by any chance ? It's picking up 2.x version of Hadoop connector when reading the file from s3.

invokeTrackingDuration is present in hadoop jar 3.x which is used present in xtable's pom.xml https://github.com/apache/hadoop/blob/branch-3.3.6/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/statistics/impl/IOStatisticsBinding.java#L541

You can look at this question as well. https://stackoverflow.com/questions/44411493/java-lang-noclassdeffounderror-org-apache-hadoop-fs-storagestatistics

lkemmerer commented 6 months ago

Hi! My team and I have also run into the issue described here while attempting to convert Hudi data stored in S3.

Using Java 11 and the SHA d991e75339f2c564897828bf6d647fcccd986cc5, and using a config file similar to the original commenter (we're converting from Hudi to Delta, but the dataset configuration uses tableBasePath, tableName, and partitionSpec and we are also using S3 for our data), we get the following:

➜  incubator-xtable git:(main) java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig ../config.yaml
WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.
2024-05-21 17:22:05 INFO  org.apache.xtable.utilities.RunSync:147 - Running sync for basePath s3://s3-bucket-XXX/x_table_prefix for following table formats [DELTA]
2024-05-21 17:22:05 INFO  org.apache.hudi.common.table.HoodieTableMetaClient:133 - Loading HoodieTableMetaClient from s3://s3-bucket-XXX/x_table_prefix
2024-05-21 17:22:05 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-05-21 17:22:05 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-05-21 17:22:06 WARN  org.apache.hadoop.fs.s3a.SDKV2Upgrade:39 - Directly referencing AWS SDK V1 credential provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain. AWS SDK V1 credential providers will be removed once S3A is upgraded to SDK V2
2024-05-21 17:22:07 INFO  org.apache.hudi.common.table.HoodieTableConfig:276 - Loading table properties from s3://s3-bucket-XXX/x_table_prefix/.hoodie/hoodie.properties
Exception in thread "main" java.lang.NoSuchMethodError: 'java.lang.Object org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(org.apache.hadoop.fs.statistics.DurationTracker, org.apache.hadoop.util.functional.CallableRaisingIOE)'
    at org.apache.hadoop.fs.s3a.Invoker.onceTrackingDuration(Invoker.java:147)
    at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:282)
    at org.apache.hadoop.fs.s3a.S3AInputStream.lambda$lazySeek$1(S3AInputStream.java:435)
    at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$3(Invoker.java:284)
    at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
    at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$5(Invoker.java:408)
    at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
    at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:404)
    at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:282)
    at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:326)
    at org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:427)
    at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:545)
    at java.base/java.io.DataInputStream.read(DataInputStream.java:149)
    at java.base/java.io.DataInputStream.read(DataInputStream.java:100)
    at java.base/java.util.Properties$LineReader.readLine(Properties.java:502)
    at java.base/java.util.Properties.load0(Properties.java:418)
    at java.base/java.util.Properties.load(Properties.java:407)
    at org.apache.hudi.common.table.HoodieTableConfig.fetchConfigs(HoodieTableConfig.java:352)
    at org.apache.hudi.common.table.HoodieTableConfig.<init>(HoodieTableConfig.java:278)
    at org.apache.hudi.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:141)
    at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:692)
    at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:85)
    at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:774)
    at org.apache.xtable.hudi.HudiSourceClientProvider.getSourceClientInstance(HudiSourceClientProvider.java:42)
    at org.apache.xtable.hudi.HudiSourceClientProvider.getSourceClientInstance(HudiSourceClientProvider.java:31)
    at org.apache.xtable.client.OneTableClient.sync(OneTableClient.java:90)
    at org.apache.xtable.utilities.RunSync.main(RunSync.java:168)

Looking at the output for mvn install and at the maven dependency graph, it appears that Hudi may have a dependency that requires Hadoop 2.10. I've attempted to cut down both outputs to show pertinent info, but I'll also include the full output as file attachments.

mvn install shaded output install.txt

[INFO] --- shade:3.5.1:shade (default) @ xtable-hudi-support-extensions ---
[INFO] Including org.apache.xtable:xtable-hudi-support-utils:jar:0.1.0-SNAPSHOT in the shaded jar.
[INFO] Including org.apache.hudi:hudi-common:jar:0.14.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-distcp:jar:2.10.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.10.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-yarn-client:jar:2.10.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-yarn-api:jar:2.10.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-yarn-common:jar:2.10.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-hdfs:jar:2.10.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-hdfs-client:jar:2.10.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-annotations:jar:3.3.6 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-auth:jar:3.3.6 in the shaded jar.

[INFO] --- shade:3.5.1:shade (default) @ xtable-utilities ---
[INFO] Including org.apache.hadoop:hadoop-distcp:jar:2.10.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.10.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-yarn-client:jar:2.10.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-yarn-common:jar:2.10.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-hdfs:jar:2.10.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-hdfs-client:jar:2.10.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-client-api:jar:3.3.4 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-client-runtime:jar:3.3.4 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-common:jar:3.3.6 in the shaded jar.
[INFO] Including org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_7:jar:1.1.1 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-annotations:jar:3.3.6 in the shaded jar.
[INFO] Including org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.1.1 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-auth:jar:3.3.6 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-yarn-server-resourcemanager:jar:3.1.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-yarn-api:jar:3.1.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-yarn-server-common:jar:3.1.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-yarn-registry:jar:3.1.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-yarn-server-applicationhistoryservice:jar:3.1.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-yarn-server-web-proxy:jar:3.1.0 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-aws:jar:3.3.6 in the shaded jar.
[INFO] Including org.apache.hadoop:hadoop-azure:jar:3.3.6 in the shaded jar.

dependencies dependencies.txt

[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] xtable                                                             [pom]
[INFO] xtable-api                                                         [jar]
[INFO] xtable-hudi-support                                                [pom]
[INFO] xtable-hudi-support-utils                                          [jar]
[INFO] xtable-core                                                        [jar]
[INFO] xtable-utilities                                                   [jar]
[INFO] xtable-hudi-support-extensions                                     [jar]
[INFO]
[INFO]
[INFO] --------------------< org.apache.xtable:xtable-api >--------------------
[INFO] Building xtable-api 0.1.0-SNAPSHOT                                 [2/7]
[INFO]   from xtable-api/pom.xml
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- dependency:3.6.1:tree (default-cli) @ xtable-api ---
[INFO] org.apache.xtable:xtable-api:jar:0.1.0-SNAPSHOT
[INFO] +- org.apache.hadoop:hadoop-common:jar:3.3.6:provided
[INFO] |  +- org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_7:jar:1.1.1:provided
[INFO] |  +- org.apache.hadoop:hadoop-annotations:jar:3.3.6:provided
[INFO] |  +- org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.1.1:provided
[INFO] |  +- org.apache.hadoop:hadoop-auth:jar:3.3.6:provided
[INFO] +- org.apache.hudi:hudi-common:jar:0.14.0:provided
[INFO] |  +- org.apache.hbase:hbase-client:jar:2.4.9:provided
[INFO] |  |  +- org.apache.hadoop:hadoop-auth:jar:2.10.0:provided
[INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:3.3.6:provided
[INFO] |  +- org.apache.hbase:hbase-server:jar:2.4.9:provided
[INFO] |  |  +- org.apache.hadoop:hadoop-distcp:jar:2.10.0:provided
[INFO] |  |  +- org.apache.hadoop:hadoop-annotations:jar:2.10.0:provided
[INFO] |  |  +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.10.0:provided
[INFO] |  |  |  +- org.apache.hadoop:hadoop-yarn-client:jar:2.10.0:provided
[INFO] |  |  |  |  \- org.apache.hadoop:hadoop-yarn-api:jar:2.10.0:provided
[INFO] |  |  |  +- org.apache.hadoop:hadoop-yarn-common:jar:2.10.0:provided
[INFO] |  |  \- org.apache.hadoop:hadoop-hdfs:jar:2.10.0:provided
[INFO] |  |     +- org.apache.hadoop:hadoop-hdfs-client:jar:2.10.0:provided
[INFO]
[INFO] -------------------< org.apache.xtable:xtable-core >--------------------
[INFO] Building xtable-core 0.1.0-SNAPSHOT                                [5/7]
[INFO]   from xtable-core/pom.xml
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- dependency:3.6.1:tree (default-cli) @ xtable-core ---
[INFO] org.apache.xtable:xtable-core:jar:0.1.0-SNAPSHOT
[INFO] +- org.apache.hudi:hudi-spark3.4-bundle_2.12:jar:0.14.0:test
[INFO] +- org.apache.hudi:hudi-common:jar:0.14.0:compile
[INFO] |  +- org.apache.hbase:hbase-client:jar:2.4.9:compile
[INFO] |  |  +- org.apache.hbase.thirdparty:hbase-shaded-protobuf:jar:3.5.1:compile
[INFO] |  |  +- org.apache.hbase:hbase-common:jar:2.4.9:compile
[INFO] |  |  |  +- org.apache.hbase:hbase-logging:jar:2.4.9:compile
[INFO] |  |  |  \- org.apache.hbase.thirdparty:hbase-shaded-gson:jar:3.5.1:compile
[INFO] |  |  +- org.apache.hbase:hbase-hadoop-compat:jar:2.4.9:compile
[INFO] |  |  +- org.apache.hbase:hbase-hadoop2-compat:jar:2.4.9:compile
[INFO] |  +- org.apache.hbase:hbase-server:jar:2.4.9:compile
[INFO] |  |  +- org.apache.hadoop:hadoop-distcp:jar:2.10.0:compile
[INFO] |  |  +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.10.0:compile
[INFO] |  |  |  +- org.apache.hadoop:hadoop-yarn-client:jar:2.10.0:compile
[INFO] |  |  |  |  \- org.apache.hadoop:hadoop-yarn-api:jar:2.10.0:compile
[INFO] |  |  |  +- org.apache.hadoop:hadoop-yarn-common:jar:2.10.0:compile
[INFO] |  |  \- org.apache.hadoop:hadoop-hdfs:jar:2.10.0:compile
[INFO] |  |     +- org.apache.hadoop:hadoop-hdfs-client:jar:2.10.0:compile
[INFO] +- org.apache.hadoop:hadoop-common:jar:3.3.6:provided
[INFO] |  +- org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_7:jar:1.1.1:provided
[INFO] |  +- org.apache.hadoop:hadoop-annotations:jar:3.3.6:compile
[INFO] |  +- org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.1.1:compile
[INFO] |  +- org.apache.hadoop:hadoop-auth:jar:3.3.6:compile
[INFO] |  +- org.apache.hadoop:hadoop-client-api:jar:3.3.4:provided
[INFO] |  +- org.apache.hadoop:hadoop-client-runtime:jar:3.3.4:provided

[INFO] -----------------< org.apache.xtable:xtable-utilities >-----------------
[INFO] Building xtable-utilities 0.1.0-SNAPSHOT                           [6/7]
[INFO]   from xtable-utilities/pom.xml
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- dependency:3.6.1:tree (default-cli) @ xtable-utilities ---
[INFO] org.apache.xtable:xtable-utilities:jar:0.1.0-SNAPSHOT
[INFO] +- org.apache.xtable:xtable-core:jar:0.1.0-SNAPSHOT:compile
[INFO] |  +- org.apache.xtable:xtable-hudi-support-utils:jar:0.1.0-SNAPSHOT:compile
[INFO] |  +- org.apache.hudi:hudi-common:jar:0.14.0:compile
[INFO] |  |  +- org.apache.hbase:hbase-client:jar:2.4.9:compile
[INFO] |  |  |  +- org.apache.hbase:hbase-hadoop-compat:jar:2.4.9:compile
[INFO] |  |  |  +- org.apache.hbase:hbase-hadoop2-compat:jar:2.4.9:compile
[INFO] |  |  +- org.apache.hbase:hbase-server:jar:2.4.9:compile
[INFO] |  |  |  +- org.apache.hadoop:hadoop-distcp:jar:2.10.0:compile
[INFO] |  |  |  +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.10.0:compile
[INFO] |  |  |  |  +- org.apache.hadoop:hadoop-yarn-client:jar:2.10.0:compile
[INFO] |  |  |  |  +- org.apache.hadoop:hadoop-yarn-common:jar:2.10.0:compile
[INFO] |  |  |  \- org.apache.hadoop:hadoop-hdfs:jar:2.10.0:compile
[INFO] |  |  |     +- org.apache.hadoop:hadoop-hdfs-client:jar:2.10.0:compile
[INFO] |  +- org.apache.hudi:hudi-java-client:jar:0.14.0:compile
[INFO] |  |  \- org.apache.hudi:hudi-client-common:jar:0.14.0:compile
[INFO] |  |     +- org.apache.hudi:hudi-timeline-service:jar:0.14.0:compile
[INFO] +- org.apache.hadoop:hadoop-common:jar:3.3.6:compile
[INFO] |  +- org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_7:jar:1.1.1:compile
[INFO] |  +- org.apache.hadoop:hadoop-annotations:jar:3.3.6:compile
[INFO] |  +- org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.1.1:compile
[INFO] |  +- org.apache.hadoop:hadoop-auth:jar:3.3.6:compile
[INFO] +- org.apache.hive:hive-common:jar:3.1.3:compile
[INFO] |  +- org.apache.hive:hive-classification:jar:3.1.3:compile
[INFO] |  +- org.apache.hive:hive-shims:jar:3.1.3:compile
[INFO] |  |  +- org.apache.hive.shims:hive-shims-0.23:jar:3.1.3:runtime
[INFO] |  |  |  \- org.apache.hadoop:hadoop-yarn-server-resourcemanager:jar:3.1.0:runtime
[INFO] |  |  |     +- org.apache.hadoop:hadoop-yarn-api:jar:3.1.0:compile
[INFO] |  |  |     +- org.apache.hadoop:hadoop-yarn-server-common:jar:3.1.0:runtime
[INFO] |  |  |     |  +- org.apache.hadoop:hadoop-yarn-registry:jar:3.1.0:runtime
[INFO] |  |  |     +- org.apache.hadoop:hadoop-yarn-server-applicationhistoryservice:jar:3.1.0:runtime
[INFO] |  |  |     \- org.apache.hadoop:hadoop-yarn-server-web-proxy:jar:3.1.0:runtime
[INFO] +- org.apache.hadoop:hadoop-aws:jar:3.3.6:runtime
[INFO]
[INFO] ----------< org.apache.xtable:xtable-hudi-support-extensions >----------
[INFO] Building xtable-hudi-support-extensions 0.1.0-SNAPSHOT             [7/7]
[INFO]   from xtable-hudi-support/xtable-hudi-support-extensions/pom.xml
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- dependency:3.6.1:tree (default-cli) @ xtable-hudi-support-extensions ---
[INFO] org.apache.xtable:xtable-hudi-support-extensions:jar:0.1.0-SNAPSHOT
[INFO] +- org.apache.xtable:xtable-hudi-support-utils:jar:0.1.0-SNAPSHOT:compile
[INFO] +- org.apache.xtable:xtable-core:jar:0.1.0-SNAPSHOT:compile
[INFO] |  +- org.apache.xtable:xtable-api:jar:0.1.0-SNAPSHOT:compile
[INFO] |  +- org.apache.hudi:hudi-common:jar:0.14.0:compile
[INFO] |  |  +- org.apache.hbase:hbase-client:jar:2.4.9:compile
[INFO] |  |  |  +- org.apache.hbase:hbase-hadoop-compat:jar:2.4.9:compile
[INFO] |  |  |  +- org.apache.hbase:hbase-hadoop2-compat:jar:2.4.9:compile
[INFO] |  |  +- org.apache.hbase:hbase-server:jar:2.4.9:compile
[INFO] |  |  |  +- org.apache.hadoop:hadoop-distcp:jar:2.10.0:compile
[INFO] |  |  |  +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.10.0:compile
[INFO] |  |  |  |  +- org.apache.hadoop:hadoop-yarn-client:jar:2.10.0:compile
[INFO] |  |  |  |  |  \- org.apache.hadoop:hadoop-yarn-api:jar:2.10.0:compile
[INFO] |  |  |  |  +- org.apache.hadoop:hadoop-yarn-common:jar:2.10.0:compile
[INFO] |  |  |  \- org.apache.hadoop:hadoop-hdfs:jar:2.10.0:compile
[INFO] |  |  |     +- org.apache.hadoop:hadoop-hdfs-client:jar:2.10.0:compile
[INFO] +- org.apache.hudi:hudi-client-common:jar:0.14.0:provided
[INFO] |  +- org.apache.hudi:hudi-timeline-service:jar:0.14.0:provided
[INFO] +- org.apache.hudi:hudi-sync-common:jar:0.14.0:provided
[INFO] +- org.apache.hadoop:hadoop-common:jar:3.3.6:provided
[INFO] |  +- org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_7:jar:1.1.1:provided
[INFO] |  +- org.apache.hadoop:hadoop-annotations:jar:3.3.6:compile
[INFO] |  +- org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.1.1:compile
[INFO] |  +- org.apache.hadoop:hadoop-auth:jar:3.3.6:compile
[INFO] +- org.apache.hudi:hudi-spark3.4-bundle_2.12:jar:0.14.0:test
[INFO] +- org.apache.hudi:hudi-java-client:jar:0.14.0:test
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for xtable 0.1.0-SNAPSHOT:
[INFO]
[INFO] xtable ............................................. SUCCESS [  1.167 s]
[INFO] xtable-api ......................................... SUCCESS [  0.270 s]
[INFO] xtable-hudi-support ................................ SUCCESS [  0.002 s]
[INFO] xtable-hudi-support-utils .......................... SUCCESS [  0.294 s]
[INFO] xtable-core ........................................ SUCCESS [  0.391 s]
[INFO] xtable-utilities ................................... SUCCESS [  0.529 s]
[INFO] xtable-hudi-support-extensions ..................... SUCCESS [  0.044 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  2.968 s
[INFO] Finished at: 2024-05-21T16:24:17-07:00
[INFO] ------------------------------------------------------------------------

I'm not familiar enough with Java to help very much here, but If there's any other information that I can add to this, let me know. Thank you!

the-other-tim-brown commented 6 months ago

@lkemmerer and @buddhayan can you try with this new branch? https://github.com/apache/incubator-xtable/pull/441

From @lkemmerer's post I can see that there is an older hadoop version being included due to the hudi-common dependency. I've added exclusions in the above branch.

lkemmerer commented 6 months ago

I no longer see hadoop v2.10 in the install logs or dependency tree (yay!) but I'm getting the (I think) same error. :-/ I ran (for completeness's sake, including git commands)

$ git pull
$ git checkout 433-hadoop-dependency
$ mvn clean install -DskipTests
$ java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig ../conversion.yaml
2024-05-22 08:19:07 INFO  org.apache.xtable.utilities.RunSync:147 - Running sync for basePath s3://s3-bucket-XXXX/x_table_prefix/ for following table formats [DELTA]
2024-05-22 08:19:07 INFO  org.apache.hudi.common.table.HoodieTableMetaClient:133 - Loading HoodieTableMetaClient from s3://s3-bucket-XXXX/x_table_prefix
2024-05-22 08:19:07 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-05-22 08:19:07 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-05-22 08:19:08 WARN  org.apache.hadoop.fs.s3a.SDKV2Upgrade:39 - Directly referencing AWS SDK V1 credential provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain. AWS SDK V1 credential providers will be removed once S3A is upgraded to SDK V2
2024-05-22 08:19:08 INFO  org.apache.hudi.common.table.HoodieTableConfig:276 - Loading table properties from s3://s3-bucket-XXXX/x_table_prefix/.hoodie/hoodie.properties
Exception in thread "main" java.lang.NoSuchMethodError: 'java.lang.Object org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(org.apache.hadoop.fs.statistics.DurationTracker, org.apache.hadoop.util.functional.CallableRaisingIOE)'
        at org.apache.hadoop.fs.s3a.Invoker.onceTrackingDuration(Invoker.java:147)
        at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:282)
        at org.apache.hadoop.fs.s3a.S3AInputStream.lambda$lazySeek$1(S3AInputStream.java:435)
        at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$3(Invoker.java:284)
        at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
        at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$5(Invoker.java:408)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
        at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:404)
        at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:282)
        at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:326)
        at org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:427)
        at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:545)
        at java.base/java.io.DataInputStream.read(DataInputStream.java:149)
        at java.base/java.io.DataInputStream.read(DataInputStream.java:100)
        at java.base/java.util.Properties$LineReader.readLine(Properties.java:502)
        at java.base/java.util.Properties.load0(Properties.java:418)
        at java.base/java.util.Properties.load(Properties.java:407)
        at org.apache.hudi.common.table.HoodieTableConfig.fetchConfigs(HoodieTableConfig.java:352)
        at org.apache.hudi.common.table.HoodieTableConfig.<init>(HoodieTableConfig.java:278)
        at org.apache.hudi.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:141)
        at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:692)
        at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:85)
        at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:774)
        at org.apache.xtable.hudi.HudiSourceClientProvider.getSourceClientInstance(HudiSourceClientProvider.java:42)
        at org.apache.xtable.hudi.HudiSourceClientProvider.getSourceClientInstance(HudiSourceClientProvider.java:31)
        at org.apache.xtable.client.OneTableClient.sync(OneTableClient.java:90)
        at org.apache.xtable.utilities.RunSync.main(RunSync.java:168)

dependencies.txt install.txt

The only thing I see that might be suspicious is the inclusion of org.apache.hbase:hbase-hadoop-compat:jar:2.4.9 as part of hudi common...

the-other-tim-brown commented 6 months ago

Thanks, let me spend some more time on this today to get to the bottom of it.

the-other-tim-brown commented 6 months ago

@lkemmerer I updated the branch with some changes to keep the hadoop-client version consistent. It is working with my AWS account now so give it another try when you get the chance.

lkemmerer commented 6 months ago

@the-other-tim-brown That worked! Thank you for the quick fix!