apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.51k stars 2.25k forks source link

Iceberg to configure AWS S3 configuration with the Hadoop and Hive4 setup is hanging without giving ant error #11145

Open AwasthiSomesh opened 2 months ago

AwasthiSomesh commented 2 months ago

Apache Iceberg version

1.6.1 (latest release)

Query engine

Hive

Please describe the bug 🐞

I am trying to configure AWS S3 configuration with the Hadoop and Hive setup.

But while trying so we are seeing following exception :

hadoop fs -ls s3a://somesh.qa.bucket/ -:

Fatal internal error java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

To resolve this I have added hadoop-aws-3.3.6.jar and aws-java-sdk-bundle-1.12.770.jar in Hadoop classpath.

i.e is under : /usr/local/hadoop/share/hadoop/common/lib

And S3 related configurations in the core-site.xml file: under /usr/local/hadoop/etc/hadoop directory.

fs.default.name s3a://somesh.qa.bucket fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3a.endpoint s3.us-west-2.amazonaws.com fs.s3a.access.key {Access _Key_Value} fs.s3a.secret.key {Secret_Key_Value} fs.s3a.path.style.access false

Now when we try hadoop fs -ls s3a://somesh.qa.bucket/

We are observing following exception :

2024-08-22 13:50:11,294 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties 2024-08-22 13:50:11,376 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 2024-08-22 13:50:11,376 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started 2024-08-22 13:50:11,434 WARN util.VersionInfoUtils: The AWS SDK for Java 1.x entered maintenance mode starting July 31, 2024 and will reach end of support on December 31, 2025. For more information, see https://aws.amazon.com/blogs/developer/the-aws-sdk-for-java-1-x-is-in-maintenance-mode-effective-july-31-2024/ You can print where on the file system the AWS SDK for Java 1.x core runtime is located by setting the AWS_JAVA_V1_PRINT_LOCATION environment variable or aws.java.v1.printLocation system property to 'true'. This message can be disabled by setting the AWS_JAVA_V1_DISABLE_DEPRECATION_ANNOUNCEMENT environment variable or aws.java.v1.disableDeprecationAnnouncement system property to 'true'. The AWS SDK for Java 1.x is being used here: at java.lang.Thread.getStackTrace(Thread.java:1564) at com.amazonaws.util.VersionInfoUtils.printDeprecationAnnouncement(VersionInfoUtils.java:81) at com.amazonaws.util.VersionInfoUtils.(VersionInfoUtils.java:59) at com.amazonaws.internal.EC2ResourceFetcher.(EC2ResourceFetcher.java:44) at com.amazonaws.auth.InstanceMetadataServiceCredentialsFetcher.(InstanceMetadataServiceCredentialsFetcher.java:38) at com.amazonaws.auth.InstanceProfileCredentialsProvider.(InstanceProfileCredentialsProvider.java:111) at com.amazonaws.auth.InstanceProfileCredentialsProvider.(InstanceProfileCredentialsProvider.java:91) at com.amazonaws.auth.InstanceProfileCredentialsProvider.(InstanceProfileCredentialsProvider.java:75) at com.amazonaws.auth.InstanceProfileCredentialsProvider.(InstanceProfileCredentialsProvider.java:58) at com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper.initializeProvider(EC2ContainerCredentialsProviderWrapper.java:66) at com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper.(EC2ContainerCredentialsProviderWrapper.java:55) at org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider.(IAMInstanceCredentialsProvider.java:53) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProvider(S3AUtils.java:727) at org.apache.hadoop.fs.s3a.S3AUtils.buildAWSProviderList(S3AUtils.java:659) at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:585) at org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:959) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:586) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3611) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3712) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3663) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:557) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:347) at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:264) at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:247) at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:105) at org.apache.hadoop.fs.shell.Command.run(Command.java:191) at org.apache.hadoop.fs.FsShell.run(FsShell.java:327) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:97) at org.apache.hadoop.fs.FsShell.main(FsShell.java:390) ls: s3a://infa.qa.bucket/: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)) 2024-08-22 13:50:14,248 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system... 2024-08-22 13:50:14,248 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped. 2024-08-22 13:50:14,248 INFO impl.MetricsSystemImpl: s3a-file-system metrics syst

Could you please help us to resolve this issue as soon as possible

Willingness to contribute

pvary commented 2 months ago

If this is a Hive4 issue, could you please try to talk to the Hive team, as the Hive4 integration is owned by them. Thanks, Peter

AwasthiSomesh commented 2 months ago

@pvary We are facing this issue with iceberg with hive not sure which team can help better on this .

Please suggest if know anything , we also raised this with hive team as well

pvary commented 2 months ago

@AwasthiSomesh: The issue name suggests that this problem happens with Hive4. That is why I suggested that the Apache Hive team could help you better. The Hive 4 integration is maintained by them. It is entirely possible that they could point out some issues with the Iceberg code, but they have some very specific Hive code before calling the Iceberg APIs.

AwasthiSomesh commented 2 months ago

@pvary thanks for ur update

AwasthiSomesh commented 2 months ago

looks like hive issue discussion is not available through git-hub anyone knows how to reach out hive4 team via GitHub

pvary commented 2 months ago

You should create a Jira (https://issues.apache.org/jira/projects/HIVE/issues/HIVE-25351?filter=allopenissues), or use the dev/user list to communicate. See the github readme: https://github.com/apache/hive

AwasthiSomesh commented 2 months ago

@pvary Thanks a lot for your quick response .

I have 2 below question could you please help me with your comments.

Q1. As mentioned in iceberg official document hive 4 supported for iceberg without any extra dependecny. https://iceberg.apache.org/docs/latest/hive/#feature-support image

Is it supported with HDFS storage or we can use it with S3/Adls gen2 as well ?.

Q2. If Hive 4 is not supported for other external storage like S3/Alds gen2 then what is the other alternative for this .. do we have any other option like hive 3/2/1 with other dependency to use iceberg with hive catalog with storage S3/Adls gen2.

Could you please help here ?.

Thanks, Somesh

AwasthiSomesh commented 2 months ago

@pvary If Iceberg supports with ADLSgen2 then what are the configuration require to use it seamless.

AwasthiSomesh commented 2 months ago

@pvary /all can anyone help here ?.

pvary commented 2 months ago

@AwasthiSomesh: This should help: https://iceberg.apache.org/docs/nightly/kafka-connect/?h=adls#azure-adls-configuration-example

AwasthiSomesh commented 2 months ago

@pvary I am able create iceberg table using hive4 setup and able to insert data as well but when we try to read its returning empty

image

Now when we see in s3 location all data file are created there.

Could you please let me know is there anything else we need to do ?

AwasthiSomesh commented 2 months ago

@pvary we set set hive.execution.engine=mr? for insert else insert was not working with tez engine

but with mr we are not able to read any single table with hive 4.0.4. alph2

with tez we are facing below error wj=hile inserting records

Error:

.6.jar:?] at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$exists$34(S3AFileSystem.java:4636) ~[hadoop-aws-3.3.6.jar:?] at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547) ~[hadoop-common-3.3.6.jar:?] at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528) ~[hadoop-common-3.3.6.jar:?] at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449) ~[hadoop-common-3.3.6.jar:?] at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2480) ~[hadoop-aws-3.3.6.jar:?] at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2499) ~[hadoop-aws-3.3.6.jar:?] at org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:4634) ~[hadoop-aws-3.3.6.jar:?] at org.apache.tez.common.TezCommonUtils.getTezBaseStagingPath(TezCommonUtils.java:91) ~[tez-api-0.10.3.jar:0.10.3] at org.apache.tez.common.TezCommonUtils.getTezSystemStagingPath(TezCommonUtils.java:149) ~[tez-api-0.10.3.jar:0.10.3] at org.apache.tez.dag.app.DAGAppMaster.serviceInit(DAGAppMaster.java:492) ~[tez-dag-0.10.3.jar:0.10.3] at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) ~[hadoop-common-3.3.6.jar:?] at org.apache.tez.dag.app.DAGAppMaster$9.run(DAGAppMaster.java:2644) ~[tez-dag-0.10.3.jar:0.10.3] at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_342] at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_342] at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) ~[hadoop-common-3.3.6.jar:?] at org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:2641) ~[tez-dag-0.10.3.jar:0.10.3] at org.apache.tez.client.LocalClient$1.run(LocalClient.java:361) ~[tez-dag-0.10.3.jar:0.10.3] ... 1 more ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask. java.io.IOException: org.apache.tez.dag.api.TezUncheckedException: java.nio.file.AccessDeniedException: s3a://com.anush/opt/hive/scratch_dir/hive/_tez_session_dir/0c1896fa-2b9d-4461-9ab4-ced0fd46ef48: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)) INFO : Completed executing command(queryId=hive_20240919065346_a71fd349-e14c-4bfa-9fb7-0b1b396565e3); Time taken: 44.607 seconds Error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask. java.io.IOException: org.apache.tez.dag.api.TezUncheckedException: java.nio.file.AccessDeniedException: s3a://com.anush/opt/hive/scratch_dir/hive/_tez_session_dir/0c1896fa-2b9d-4461-9ab4-ced0fd46ef48: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)) (state=08S01,code=1) 0: jdbc:hive2://localhost:10000/>

AwasthiSomesh commented 2 months ago

Hi all any one please help here.

BsoBird commented 1 month ago

@AwasthiSomesh hello. can you try apply one patch from me and try again?

AwasthiSomesh commented 1 month ago

@pvary please tell will do it

BsoBird commented 1 month ago

@AwasthiSomesh check u email.

steveloughran commented 1 month ago
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask. java.io.IOException: org.apache.tez.dag.api.TezUncheckedException: java.nio.file.AccessDeniedException: s3a://com.anush/opt/hive/scratch_dir/hive/_tez_session_dir/0c1896fa-2b9d-4461-9ab4-ced0fd46ef48:

you don't have write permission to that path.

Tez should handle it better

if your bucket really is called "com.anush" no that S3AfS doesn't support that, amazon say "exclusively for web sites", and with good reason.

Also, that aws warning message about deprecation flags you are using a later version of the AWS SDK than any hadoop release. Your choice, but bear in mind it hasn't been qualified, and those SDKs can be fussy at times.