apache / iceberg-go

Apache Iceberg - Go
https://iceberg.apache.org/
Apache License 2.0
127 stars 30 forks source link

feat(catalog): add initial rest catalog impl #58

Closed zeroshade closed 8 months ago

zeroshade commented 9 months ago

Adding an initial implementation and unit tests for the Rest catalog.

zeroshade commented 9 months ago

CC @Fokko @wolfeidau @nastra @HonahX @jackye1995

nastra commented 9 months ago

I think what's currently missing is having a way to configure the warehouse (which I hardcoded for testing) but also handling the signing part of requests against S3, similar to https://github.com/apache/iceberg-python/blob/f66e3652fdf9720d6c63a6fcec7bcd08d5bb186c/pyiceberg/io/fsspec.py#L70-L95

Listing files via go run ./cmd/iceberg files iceberg124.foobar --catalog rest --uri https://api.dev.tabular.io/ws/ --credential <creds> will fail with

2024/02/13 10:24:07 could not open manifest file: operation error S3: GetObject, https response error StatusCode: 403, RequestID: 066G7WZD23KHZCBJ, HostID: d4V0iCd2uzvp9gZJWDOWmljaREgSaL9Iro0XxOFsv38ECJpdCd/JHWG8Y6/i7oSal8cONZ87Tis=, api error AccessDenied: Access Denied
exit status 1

I believe this is because FileIO isn't configured with the TOKEN in the authorization header that's coming back from the config inside tblResponse here. Reading all other metadata of tables work via CLI, but this is because those never use FileIO and only files does that atm.

zeroshade commented 8 months ago

@nastra

Hmm. So, setting the env vars AWS_REGION, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY should all work and get picked up by the FileIO. But I haven't tried testing with https://api.dev.tabular.io/ws/ before.

There is the ability to set a session token via the s3.session-token property but you're right that I don't think it gets propagated. Is there any special configuration I need to set up in order to try testing out the api.dev.tabular.io/ws/ uri myself?

zeroshade commented 8 months ago

@nastra So I've figured out the issue:

The properties are correctly being propagated to the FileIO object, however it looks like the tabular api doesn't like the Go Iceberg user-agent.

I loaded up pyiceberg to see what it does differently and how it works, and saw that the request for the table included in its response a series of s3 properties including an access-key-id, session-token, and secret-access-key in the config. When I looked at the same request from the Go cli those properties weren't there. If I hardcode and change the User-Agent that the Go CLI passes to be PyIceberg/0.5.1 suddenly those properties are returned and loading the manifests works just fine. So the problem is definitely the fact that the User-Agent isn't recognized by the tabular rest catalog enough for it to send the s3 key properties.

Anything we can do on the tabular side? During RestCatalog.LoadTable

zeroshade commented 8 months ago

@nastra Added several issues as suggested