Closed ndrluis closed 3 weeks ago
I performed another test using the Tabular catalog, attempting to scan the sandbox warehouse in the examples namespace, specifically targeting the nyc_taxi_yellow table, but it returned no results.
I found the problem. I don’t know how to solve it, but I will try.
The while let Some(Ok(task)) = tasks.next().await statement
is hiding some errors. In my previous attempt, I was trying to run it without the S3 credentials and was not receiving the access denied error. This happens because tasks.next()
returns the error but does not expose it to the user.
While testing with Tabular, I'm receiving a 403 error from S3. So, we have two issues to solve.
One is to expose the reading errors to the user, and the other is to understand why we are getting these access denied errors.
For the Tabular example, I encountered an 'access denied' problem. The FileIO does not work with remote signing. For the MinIO example, the problem was solved when I added a match statement to return the error while tasks.next().
To scan with remote-signing we need to implement this
Hi, does remote signing
means presign
in s3?
I'm guessing https://github.com/apache/iceberg-rust/pull/498 should close this issue. Would you like to verify it?
@Xuanwo
Hi, does remote signing means presign in s3?
Yes and no. I'm not sure if this is the flow, because I haven't found any documentation; this is based on my understanding from reading the Python implementation.
It's a presign process, but it's not the client's responsibility to presign. The get config will return the s3.signer.uri, and the load table will return s3.remote-signing-enabled as true along with some other S3 configurations. With that, we need to "presign" using the token returned in the load table.
This is the specification for the server responsible for the signing: s3-signer-open-api.yaml
I'm guessing https://github.com/apache/iceberg-rust/pull/498 should close this issue. Would you like to verify it?
I'm not comfortable closing this issue without a regression test that guarantees the expected behavior.
I'm not comfortable closing this issue without a regression test that guarantees the expected behavior.
+1 on this. Currently we don't have regression tests on the whole reading progress, which involves integrating with external systems such as spark.
It's a presign process, but it's not the client's responsibility to presign. The get config will return the s3.signer.uri, and the load table will return s3.remote-signing-enabled as true along with some other S3 configurations. With that, we need to "presign" using the token returned in the load table.
Got it. So, we need to support presign
in the REST catalog. Could you help by creating an issue for this? I'll review this section and draft a plan for its implementation.
Currently we don't have regression tests on the whole reading progress, which involves integrating with external systems such as spark.
I think we can start with very basic tests like just scan the whole table.
I think we can start with very basic tests like just scan the whole table.
The reason I didn't start this yet is that I want to do it after integration with datafusion. Me and @ZENOTME did integration tests in icelake before, and I have to say that without sql engine support, it's painful to maintain those tests.
Me and @ZENOTME did integration tests in icelake before, and I have to say that without sql engine support, it's painful to maintain those tests.
I agree that we need a SQL engine to make testing easier.
However, maintaining basic unit tests based on fs
or memory
should be straightforward, right? We don't need separate test modules; just implement them as unit tests in the REST catalog. For example, it could be as simple as...
// catalog / file io setup, balbalba
let table = balabala();
let scan = table.scan().select_all().build().unwrap();
let batch_stream = scan.to_arrow().await.unwrap();
dbg!(scan);
let batches: Vec<_> = batch_stream.try_collect().await.unwrap();
Got it. So, we need to support presign in the REST catalog. Could you help by creating an issue for this? I'll review this section and draft a plan for its implementation.
Issue #504 created
Me and @ZENOTME did integration tests in icelake before, and I have to say that without sql engine support, it's painful to maintain those tests.
I agree that we need a SQL engine to make testing easier.
However, maintaining basic unit tests based on
fs
ormemory
should be straightforward, right? We don't need separate test modules; just implement them as unit tests in the REST catalog. For example, it could be as simple as...// catalog / file io setup, balbalba let table = balabala(); let scan = table.scan().select_all().build().unwrap(); let batch_stream = scan.to_arrow().await.unwrap(); dbg!(scan); let batches: Vec<_> = batch_stream.try_collect().await.unwrap();
Correctly writing data into iceberg is not supported yet, so we need external systems such as spark to ingest data. Putting pre generated parquet files maybe an approach, but that requires maintaining binaries in repo.
Correctly writing data into iceberg is not supported yet, so we need external systems such as spark to ingest data. Putting pre generated parquet files maybe an approach, but that requires maintaining binaries in repo.
I've got some code in the perf testing branch that might help. It downloads NYC taxi data, and uses minio, the rest catalog and a spark container to create a table and insert NYC taxi data into it.
I have fixed the issue where errors were not returned to the user, in https://github.com/apache/iceberg-rust/pull/535
I believe this should have been fixed. Please feel free to open new issues if still exists.
I'm testing using the iceberg rest image from Tabular as a catalog.
Here's the docker-compose.yml file:
I created some data with PyIceberg:
And queried with PyIceberg to verify if it's okay:
It returns 4.
And then with the Rust implementation:
Its returning nothing.
We have to define the S3 configurations because the Tabular image does not return the S3 credentials during the get config process.