Open vagetablechicken opened 9 months ago
@vagetablechicken can you check whether the nyc
schema actually exists before querying?
@vagetablechicken can you check whether the
nyc
schema actually exists before querying?
@nastra Thanks for help. Yes, the table exists. It failed on s3 reading.
echo "desc demo.nyc.taxis;" | bin/spark-sql
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Spark Web UI available at http://node-4.sg.4pd.io:4040
Spark master: local[*], Application Id: local-1705373054068
spark-sql ()> desc demo.nyc.taxis;
vendor_id bigint
trip_id bigint
trip_distance float
fare_amount double
store_and_fwd_flag string
# Partition Information
# col_name data_type comment
vendor_id bigint
Time taken: 1.431 seconds, Fetched 8 row(s)
spark-sql ()> ⏎
I think you might be missing the warehouse
configuration. Here's how it's set for the quickstart example: https://github.com/tabular-io/docker-spark-iceberg/blob/main/spark/spark-defaults.conf#L27. Can you try setting that?
I think you might be missing the
warehouse
configuration. Here's how it's set for the quickstart example: https://github.com/tabular-io/docker-spark-iceberg/blob/main/spark/spark-defaults.conf#L27. Can you try setting that?
But pyiceberg doesn't need this, I think it's useless. Anyway, I tried to add warehouse
, the same error NoSuchKeyException: null (Service: S3, Status Code: 404
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
hi, I'm doing some tests, based on spark quickstart https://iceberg.apache.org/spark-quickstart/. I start up it, and try to connect iceberg from outside(host->iceberg_in_containers). And it works when i use pyiceberg in host, I can read & write. But it'll be failed when i use spark-sql in host(3.5.0, the same version with spark-iceberg image). The steps in host are in snippet. Any help/suggestion would be appreciated.
Slack Message
spark-sql ()> SELECT * FROM demo.nyc.taxis; null (Service: S3, Status Code: 404, Request ID: 17AAB3E7394C47F0, Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8) (Service: S3, Status Code: 404, Request ID: 17AAB3E7394C47F0) software.amazon.awssdk.services.s3.model.NoSuchKeyException: null (Service: S3, Status Code: 404, Request ID: 17AAB3E7394C47F0, Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8) (Service: S3, Status Code: 404, Request ID: 17AAB3E7394C47F0) at software.amazon.awssdk.services.s3.model.NoSuchKeyException$BuilderImpl.build(NoSuchKeyException.java:126) at software.amazon.awssdk.services.s3.model.NoSuchKeyException$BuilderImpl.build(NoSuchKeyException.java:80) at software.amazon.awssdk.services.s3.internal.handlers.ExceptionTranslationInterceptor.modifyException(ExceptionTranslationInterceptor.java:63) at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.modifyException(ExecutionInterceptorChain.java:202) at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.ExceptionReportingUtils.runModifyException(ExceptionReportingUtils.java:54) at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.ExceptionReportingUtils.reportFailureToInterceptors(ExceptionReportingUtils.java:38) at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:39) at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26) at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:196) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:171) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:82) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:179) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:76) at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:56) at software.amazon.awssdk.services.s3.DefaultS3Client.headObject(DefaultS3Client.java:5445) at org.apache.iceberg.aws.s3.BaseS3File.getObjectMetadata(BaseS3File.java:85) at org.apache.iceberg.aws.s3.S3InputFile.getLength(S3InputFile.java:77) at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:100) at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:76) at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:36) at org.apache.iceberg.relocated.com.google.common.collect.Iterables.addAll(Iterables.java:333) at org.apache.iceberg.relocated.com.google.common.collect.Lists.newLinkedList(Lists.java:241) at org.apache.iceberg.ManifestLists.read(ManifestLists.java:45) at org.apache.iceberg.BaseSnapshot.cacheManifests(BaseSnapshot.java:146) at org.apache.iceberg.BaseSnapshot.deleteManifests(BaseSnapshot.java:180) at org.apache.iceberg.BaseDistributedDataScan.findMatchingDeleteManifests(BaseDistributedDataScan.java:207) at org.apache.iceberg.BaseDistributedDataScan.doPlanFiles(BaseDistributedDataScan.java:148) at org.apache.iceberg.SnapshotScan.planFiles(SnapshotScan.java:139) at org.apache.iceberg.spark.source.SparkPartitioningAwareScan.tasks(SparkPartitioningAwareScan.java:174) at org.apache.iceberg.spark.source.SparkPartitioningAwareScan.taskGroups(SparkPartitioningAwareScan.java:202) at org.apache.iceberg.spark.source.SparkPartitioningAwareScan.outputPartitioning(SparkPartitioningAwareScan.java:104) at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.applyOrElse(V2ScanPartitioningAndOrdering.scala:44) at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.applyOrElse(V2ScanPartitioningAndOrdering.scala:42)
from pyiceberg.catalog import load_catalog
catalog = load_catalog( "docs", **{ "uri": "http://localhost:8181", "s3.endpoint": "http://localhost:9000",
"py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",
)
print(catalog.list_namespaces()) table = catalog.load_table("nyc.taxis") scan = table.scan() print(scan.to_pandas())