apache / polaris

The interoperable, open source catalog for Apache Iceberg
http://polaris.io/
Apache License 2.0
1.01k stars 96 forks source link

[FEATURE REQUEST] On-Premise S3 & Remote Signing #32

Open c-thiel opened 1 month ago

c-thiel commented 1 month ago

Is your feature request related to a problem? Please describe. Currently Polaris only works for AWS S3. It would be great to get support for on-prem deployments as well!

Describe the solution you'd like Add an additional storage profile similar to the AWS one which allows custom Endpoint configuration. Test with MinIO or Ceph. Grant access via remote signing.

Describe alternatives you've considered I don't think there is any?

Additional context Remote signing spec: https://github.com/apache/iceberg/blob/main/aws/src/main/resources/s3-signer-open-api.yaml

mjf-89 commented 1 month ago

It would be great to know if such feature is already on the roadmap or not. I would be personally interested in contributing here because currently I'm working for a company where we have everything on-premise. We are using Iceberg with a legacy Hive Standalone Metastore and we are looking for a REST alternative. Polaris is really promising but the lack of support for on premise S3 provider will hinder its adoption.

chris922 commented 1 month ago

When I watched the demo on Snowflake site this was the first thing I noticed - where to configure the S3 endpoint etc.

I can also support here, maybe in development but also testing it with some S3 alternatives. I've got access to Dell ECS, NetApp StorageGRID, MinIO

guitcastro commented 1 month ago

The main point is that only few S3 compatible services have support for Security Token Service (STS). Minio does have support for it.

mjf-89 commented 1 month ago

@guitcastro It might be possible to implement remote signing for on prem S3 implementations other than minio. But that would mean implement also the remote signing open API spec and that is probably outside the scope of polaris?

guitcastro commented 1 month ago

@mjf-89 I don't know more than you. I am not maintainer, my comment is just based on how the S3 auth are implemented.

dimas-b commented 1 month ago

@mjf-89 : could you provide more details (perhaps a link) about the remote signing open API spec that you mentioned above?

snazy commented 1 month ago

The main point is that only few S3 compatible services have support for Security Token Service (STS). Minio does have support for it.

STS is needed for credential-vending, as currently implement.

The actual request signing doesn't interact w/ any remote service. The client (in this case Iceberg) asks the resource (Polaris) to return a signed URL for every particular S3 request.

mjf-89 commented 1 month ago

@dimas-b sure, in the openapi spec of the rest catalog you can see that there are currently two supported delegated access mechanisms, vended credentials and remote signing:

https://github.com/apache/iceberg/blob/e9364faabcc67eef6c61af2ecdf7bcf9a3fef602/open-api/rest-catalog-open-api.yaml#L1488

And here you can find the openapi spec for the remote signing service:

https://github.com/apache/iceberg/blob/e9364faabcc67eef6c61af2ecdf7bcf9a3fef602/aws/src/main/resources/s3-signer-open-api.yaml

I don't know of any open source implemention of that openapi spec, however I think that Tabular is based on such a thing. Or at least that is what I guessed reading their blog posts:

https://tabular.io/blog/securing-the-data-lake-part-1/

Where I have interpreted the "authorized file access request" as a presigned url that the remote signing service is giving back to the engine to access the data files.

c-thiel commented 1 month ago

I agree that STS is the better solution if available, but not all S3 Services support it. It would be nice to add it at a fallback for on-premise deployments. @mjf-89 there are currently two open-source catalogs that support it, Project Nessie and the TIP Iceberg Catalog - links go roughly to the corresponding code sections.

mjf-89 commented 1 month ago

@c-thiel thank you very much, last time that I checked on Nessie the iceberg rest api was still not implemented and S3 remote signing was definitely not there, TIP was completely outside my radar but it seem really promising, especially for the customization freedom on the authz side. Happy to see that the landscape of iceberg rest catalog is evolving so rapidly.

As for Polaris I hope that remote signing can be implemented as a fallback for those S3 implementations that do not have an sts endpoint like you have said.

dimas-b commented 4 weeks ago

Remote Signing can be a useful feature. I'd support adding it to Polaris.

  1. The catalog does not expose any long of mid-term credentials to the client (reduces risk of credential leaks and makes access revocation is immediate, if/when it happens).
  2. Client session runtime is not limited by STS session restrictions (extremely long client sessions are possible at the expense of slightly slower storage I/O calls).
  3. The catalog can (hypothetically) make finer-grained access decisions that are not expressible in terms of STS policies.
dimas-b commented 4 weeks ago

@mjf-89 :

I don't know of any open source implemention of that openapi spec

Just FYI: Nessie supports that.

mjf-89 commented 3 weeks ago

@dimas-b thank you, as @c-thiel already mentioned both Nessie and TIP actually support that feature.

One question that I have regards the performance implications of remote signing. I feel like it could introduce quite a bit of latency to the queries, of course mich depend on the implementation of both the runtime and the catalog.

Another thing to be noted is that not all the runtimes actually support such feature. As an example I think that currently trino lacks such support: https://github.com/trinodb/trino/issues/21189