vagetablechicken commented 9 months ago

hi, I'm doing some tests, based on spark quickstart https://iceberg.apache.org/spark-quickstart/. I start up it, and try to connect iceberg from outside(host->iceberg_in_containers). And it works when i use pyiceberg in host, I can read & write. But it'll be failed when i use spark-sql in host(3.5.0, the same version with spark-iceberg image). The steps in host are in snippet. Any help/suggestion would be appreciated.

Start quickstart docker compose, create table and insert

Steps on host:

export envs

export AWS_ACCESS_KEY_ID=admin
export AWS_SECRET_ACCESS_KEY=password
export AWS_REGION=us-east-1

spark config(rest&s3 bind localhost ports)

spark.sql.extensions                   org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.demo                 org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.demo.type            rest
spark.sql.catalog.demo.uri             http://localhost:8181
spark.sql.catalog.demo.s3.endpoint     http://localhost:9000
spark.sql.defaultCatalog               demo

run sql


echo "SELECT * FROM demo.nyc.taxis;" | bin/spark-sql

spark-sql ()> SELECT * FROM demo.nyc.taxis; null (Service: S3, Status Code: 404, Request ID: 17AAB3E7394C47F0, Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8) (Service: S3, Status Code: 404, Request ID: 17AAB3E7394C47F0) software.amazon.awssdk.services.s3.model.NoSuchKeyException: null (Service: S3, Status Code: 404, Request ID: 17AAB3E7394C47F0, Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8) (Service: S3, Status Code: 404, Request ID: 17AAB3E7394C47F0) at software.amazon.awssdk.services.s3.model.NoSuchKeyException$BuilderImpl.build(NoSuchKeyException.java:126) at software.amazon.awssdk.services.s3.model.NoSuchKeyException$BuilderImpl.build(NoSuchKeyException.java:80) at software.amazon.awssdk.services.s3.internal.handlers.ExceptionTranslationInterceptor.modifyException(ExceptionTranslationInterceptor.java:63) at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.modifyException(ExecutionInterceptorChain.java:202) at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.ExceptionReportingUtils.runModifyException(ExceptionReportingUtils.java:54) at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.ExceptionReportingUtils.reportFailureToInterceptors(ExceptionReportingUtils.java:38) at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:39) at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26) at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:196) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:171) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:82) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:179) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:76) at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:56) at software.amazon.awssdk.services.s3.DefaultS3Client.headObject(DefaultS3Client.java:5445) at org.apache.iceberg.aws.s3.BaseS3File.getObjectMetadata(BaseS3File.java:85) at org.apache.iceberg.aws.s3.S3InputFile.getLength(S3InputFile.java:77) at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:100) at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:76) at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:36) at org.apache.iceberg.relocated.com.google.common.collect.Iterables.addAll(Iterables.java:333) at org.apache.iceberg.relocated.com.google.common.collect.Lists.newLinkedList(Lists.java:241) at org.apache.iceberg.ManifestLists.read(ManifestLists.java:45) at org.apache.iceberg.BaseSnapshot.cacheManifests(BaseSnapshot.java:146) at org.apache.iceberg.BaseSnapshot.deleteManifests(BaseSnapshot.java:180) at org.apache.iceberg.BaseDistributedDataScan.findMatchingDeleteManifests(BaseDistributedDataScan.java:207) at org.apache.iceberg.BaseDistributedDataScan.doPlanFiles(BaseDistributedDataScan.java:148) at org.apache.iceberg.SnapshotScan.planFiles(SnapshotScan.java:139) at org.apache.iceberg.spark.source.SparkPartitioningAwareScan.tasks(SparkPartitioningAwareScan.java:174) at org.apache.iceberg.spark.source.SparkPartitioningAwareScan.taskGroups(SparkPartitioningAwareScan.java:202) at org.apache.iceberg.spark.source.SparkPartitioningAwareScan.outputPartitioning(SparkPartitioningAwareScan.java:104) at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.applyOrElse(V2ScanPartitioningAndOrdering.scala:44) at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.applyOrElse(V2ScanPartitioningAndOrdering.scala:42)


P.S. pyiceberg code, succeed on host

from pyiceberg.catalog import load_catalog

catalog = load_catalog( "docs", **{ "uri": "http://localhost:8181", "s3.endpoint": "http://localhost:9000",

"py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",

    "s3.access-key-id": "admin",
    "s3.secret-access-key": "password",
}

)

print(catalog.list_namespaces()) table = catalog.load_table("nyc.taxis") scan = table.scan() print(scan.to_pandas())

nastra commented 9 months ago

@vagetablechicken can you check whether the nyc schema actually exists before querying?

vagetablechicken commented 9 months ago

@vagetablechicken can you check whether the nyc schema actually exists before querying?

@nastra Thanks for help. Yes, the table exists. It failed on s3 reading.

echo "desc demo.nyc.taxis;" | bin/spark-sql
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Spark Web UI available at http://node-4.sg.4pd.io:4040
Spark master: local[*], Application Id: local-1705373054068
spark-sql ()> desc demo.nyc.taxis;
vendor_id               bigint
trip_id                 bigint
trip_distance           float
fare_amount             double
store_and_fwd_flag      string
# Partition Information
# col_name              data_type               comment
vendor_id               bigint
Time taken: 1.431 seconds, Fetched 8 row(s)
spark-sql ()> ⏎

nastra commented 9 months ago

I think you might be missing the warehouse configuration. Here's how it's set for the quickstart example: https://github.com/tabular-io/docker-spark-iceberg/blob/main/spark/spark-defaults.conf#L27. Can you try setting that?

vagetablechicken commented 9 months ago

I think you might be missing the warehouse configuration. Here's how it's set for the quickstart example: https://github.com/tabular-io/docker-spark-iceberg/blob/main/spark/spark-defaults.conf#L27. Can you try setting that?

But pyiceberg doesn't need this, I think it's useless. Anyway, I tried to add warehouse, the same error NoSuchKeyException: null (Service: S3, Status Code: 404

github-actions[bot] commented 4 days ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

apache / iceberg

access failed from host to iceberg container #9465

"py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",