apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.35k stars 2.2k forks source link

access failed from host to iceberg container #9465

Open vagetablechicken opened 9 months ago

vagetablechicken commented 9 months ago

hi, I'm doing some tests, based on spark quickstart https://iceberg.apache.org/spark-quickstart/. I start up it, and try to connect iceberg from outside(host->iceberg_in_containers). And it works when i use pyiceberg in host, I can read & write. But it'll be failed when i use spark-sql in host(3.5.0, the same version with spark-iceberg image). The steps in host are in snippet. Any help/suggestion would be appreciated.

Slack Message

spark-sql ()> SELECT * FROM demo.nyc.taxis; null (Service: S3, Status Code: 404, Request ID: 17AAB3E7394C47F0, Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8) (Service: S3, Status Code: 404, Request ID: 17AAB3E7394C47F0) software.amazon.awssdk.services.s3.model.NoSuchKeyException: null (Service: S3, Status Code: 404, Request ID: 17AAB3E7394C47F0, Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8) (Service: S3, Status Code: 404, Request ID: 17AAB3E7394C47F0) at software.amazon.awssdk.services.s3.model.NoSuchKeyException$BuilderImpl.build(NoSuchKeyException.java:126) at software.amazon.awssdk.services.s3.model.NoSuchKeyException$BuilderImpl.build(NoSuchKeyException.java:80) at software.amazon.awssdk.services.s3.internal.handlers.ExceptionTranslationInterceptor.modifyException(ExceptionTranslationInterceptor.java:63) at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.modifyException(ExecutionInterceptorChain.java:202) at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.ExceptionReportingUtils.runModifyException(ExceptionReportingUtils.java:54) at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.ExceptionReportingUtils.reportFailureToInterceptors(ExceptionReportingUtils.java:38) at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:39) at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26) at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:196) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:171) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:82) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:179) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:76) at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:56) at software.amazon.awssdk.services.s3.DefaultS3Client.headObject(DefaultS3Client.java:5445) at org.apache.iceberg.aws.s3.BaseS3File.getObjectMetadata(BaseS3File.java:85) at org.apache.iceberg.aws.s3.S3InputFile.getLength(S3InputFile.java:77) at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:100) at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:76) at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:36) at org.apache.iceberg.relocated.com.google.common.collect.Iterables.addAll(Iterables.java:333) at org.apache.iceberg.relocated.com.google.common.collect.Lists.newLinkedList(Lists.java:241) at org.apache.iceberg.ManifestLists.read(ManifestLists.java:45) at org.apache.iceberg.BaseSnapshot.cacheManifests(BaseSnapshot.java:146) at org.apache.iceberg.BaseSnapshot.deleteManifests(BaseSnapshot.java:180) at org.apache.iceberg.BaseDistributedDataScan.findMatchingDeleteManifests(BaseDistributedDataScan.java:207) at org.apache.iceberg.BaseDistributedDataScan.doPlanFiles(BaseDistributedDataScan.java:148) at org.apache.iceberg.SnapshotScan.planFiles(SnapshotScan.java:139) at org.apache.iceberg.spark.source.SparkPartitioningAwareScan.tasks(SparkPartitioningAwareScan.java:174) at org.apache.iceberg.spark.source.SparkPartitioningAwareScan.taskGroups(SparkPartitioningAwareScan.java:202) at org.apache.iceberg.spark.source.SparkPartitioningAwareScan.outputPartitioning(SparkPartitioningAwareScan.java:104) at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.applyOrElse(V2ScanPartitioningAndOrdering.scala:44) at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.applyOrElse(V2ScanPartitioningAndOrdering.scala:42)


P.S. pyiceberg code, succeed on host

from pyiceberg.catalog import load_catalog

catalog = load_catalog( "docs", **{ "uri": "http://localhost:8181", "s3.endpoint": "http://localhost:9000",

"py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",

    "s3.access-key-id": "admin",
    "s3.secret-access-key": "password",
}

)

print(catalog.list_namespaces()) table = catalog.load_table("nyc.taxis") scan = table.scan() print(scan.to_pandas())

nastra commented 9 months ago

@vagetablechicken can you check whether the nyc schema actually exists before querying?

vagetablechicken commented 9 months ago

@vagetablechicken can you check whether the nyc schema actually exists before querying?

@nastra Thanks for help. Yes, the table exists. It failed on s3 reading.

echo "desc demo.nyc.taxis;" | bin/spark-sql
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Spark Web UI available at http://node-4.sg.4pd.io:4040
Spark master: local[*], Application Id: local-1705373054068
spark-sql ()> desc demo.nyc.taxis;
vendor_id               bigint
trip_id                 bigint
trip_distance           float
fare_amount             double
store_and_fwd_flag      string
# Partition Information
# col_name              data_type               comment
vendor_id               bigint
Time taken: 1.431 seconds, Fetched 8 row(s)
spark-sql ()> ⏎
nastra commented 9 months ago

I think you might be missing the warehouse configuration. Here's how it's set for the quickstart example: https://github.com/tabular-io/docker-spark-iceberg/blob/main/spark/spark-defaults.conf#L27. Can you try setting that?

vagetablechicken commented 9 months ago

I think you might be missing the warehouse configuration. Here's how it's set for the quickstart example: https://github.com/tabular-io/docker-spark-iceberg/blob/main/spark/spark-defaults.conf#L27. Can you try setting that?

But pyiceberg doesn't need this, I think it's useless. Anyway, I tried to add warehouse, the same error NoSuchKeyException: null (Service: S3, Status Code: 404

github-actions[bot] commented 4 days ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.