delta-incubator / delta-sharing-rs

A Minimalistic Rust Implementation of Delta Sharing Server.
MIT License
77 stars 8 forks source link

Add support for generating Azure Blob Storage signed URLs #22

Closed tdikland closed 6 months ago

tdikland commented 6 months ago

This PR introduces support for sharing Delta Lake tables stored on Azure Blob Storage.

I've marked the PR as WIP, since I feel there are still a couple of open questions that we may want to address in this or subsequent issues.

  1. Is it okay to change the signature of the function signing the urls to be async? The crate that I used to call the Azure SDK for building a pre-signed URL has an async method to do so. I figured this was okay, since all route handlers are async anyway.
  2. What should we do if there are no credentials available to sign the urls? Right now the urls are passed through to the client unsigned. We may want to present an error instead.
  3. In Azure there are multiple possibilities for authentication to Azure Blob Storage. I've chosen to implement only the Shared Key method, since that brings this crate at parity with the scala implementation

On a higher level:

  1. I think it is a good idea to bring in a crate that can handle signing URLs for the different cloud object stores where the Delta Table lives. To this end I've published a crate that I used in some other personal projects to do exactly this https://crates.io/crates/cloud-file-signer. This crate defines a CloudFileSigner trait and implements it for a couple of popular object stores (S3, Azure Blob Storage, GCS) and could be extended to other stores (R2, MinIO, etc.)

Looking forward to some feedback!

Closes #18

ognis1205 commented 6 months ago

Hi Tim, thank you for submitting this PR!

I haven't delved into the details of the PR yet, but it looks great. That aside, I like your idea of the Delta Sharing Server SDK, the delta-sharing-server-rs project. Since the Delta Sharing protocol is defined through REST APIs, there may be some considerations to keep in mind; nevertheless, the idea is interesting.

As for the cloud-file-signer crate, I believe it would be more beneficial to expand it into a larger project, similar to the delta-sharing-server-rs. If your contribution here serves any purpose, please feel free to incorporate it.

ognis1205 commented 6 months ago

@tdikland

Regarding the open questions, here are my answers:

  1. Yes, it is totally fine, and it would be better to do so.
  2. According to the official protocol specification, there is no explicit description. However, I think we should return a 500 error and trace the error.
  3. For now, I would not like to introduce complexity, so I really appreciate your implementation that aligns with the reference server implementation. When we want to add further options, we should create another issue.
tdikland commented 6 months ago

Thanks for your quick reaction and enthusiasm @ognis1205.

Regarding the delta-sharing-server-rs project, I think it would be great for users to have both the lib that implements the delta sharing protocol and the batteries included version (which this crate is currently focusing on). The user journey that I'm thinking about is a user starts by deploying the reference implementation of the delta sharing server and when customisation is needed they can take the lib and plugin custom auth, retry, other middleware, different database etc. It may be nice to support this user journey with this project. I will open a new issue to discuss this in more depth.

Regarding the support for the underlying object stores (i.e. AWS, Azure, GCP). Thanks for your insights on my questions. I will be iterating on the implementation and let you know when it's ready for review.

ognis1205 commented 6 months ago

I am very grateful for your contribution @tdikland !

This is a quick review of the current commit. Regarding the 'cloud-file-signer' crate, I believe there is operational experience in your project, but I think it still needs a bit more incubation. Therefore, I believe it's still early to use the crate. Since this project itself is also in incubation, considering the future development speed, I would prefer to have the implementation of signed URLs for each cloud provider done here as much as possible. The experience gained here can then be reflected in 'cloud-file-signer,' and eventually, I would like to use the crate in 'delta-sharing-rs' as well.

Please do not remove utilities::signed_url (or rename it utilities::cloud_signer). Instead, can we consider implementing factory methods to instantiate the cloud signer instance from the credentials of each provider (similar to what is done in the bootstrap module)? I would like a factory method to be added to the struct Utility in utilities::signed_url (or utilities::cloud_signer) module. The reason is that having information in the form of logging, indicating which provider's credentials were loaded during server bootstrap and the design and the implementation would be more consistent with other modules. What do you think?

ognis1205 commented 6 months ago

@tdikland

Regarding the 'cloud-file-signer' crate, even if, for instance, we are not currently using the crate, I am considering stating in the README that we anticipate potential future use of it as a related project.

tdikland commented 6 months ago

Thanks for the feedback. I agree it is not yet desirable to depend on the cloud-file-signer crate. I was experimenting with it and accidentally pushed it to this PR (sorry!). I'll give a heads-up once it is ready for review!

ognis1205 commented 6 months ago

You don't have to apologize for it at all! I apologize if I have caused any disruption to the development. We should have prepared a more accessible system, such as a Code of Conduct (COC), to make participation easier. It was very helpful that you made various suggestions and asked questions.

tdikland commented 6 months ago

Allright, I think I got it down now.

ognis1205 commented 6 months ago

Thank you for your cooperation @tdikland :)

I believe we have reached a consensus on the general direction of the implementation. It seems there are still some parts that I would like to discuss though, before proceeding to the review, let's go ahead and run some tests for now. @roeap could you please approve the workflow? (I deliberately commented out the #[tokio::test] attribute in some unit tests, but I might have added comments to explain. It's my mistake. Let's discuss it later as well)

When you are ready to review, please remove [WIP] from the title. Or do you prefer to conduct detailed reviews gradually?

tdikland commented 6 months ago

Hi @ognis1205, @roeap, I'm ready for reviewing.

tustvold commented 6 months ago

acquire pre-signed URls

We already support this for S3. Azure and GCP are more complex but would be happy for contributions in this space - https://docs.rs/object_store/latest/object_store/signer/trait.Signer.html

ognis1205 commented 6 months ago

@tdikland @roeap I will review this PR in the next 12 hours, here in Tokyo, it's around 24:00 midnight.

I did not see the discussion around delta-sharing-server-rs, but in the past @ognis1205 and I had discussed that we want to break this up into smaller crates. and split the different "sub-apis" such that one can easily mix-and-match a server that supports the sharing protocol with just the desired features.

Is there any place to discuss this issue? I have some opinions on this since the protocol is defined over REST APIs. I think this project might be helpful:

https://github.com/HeroicKatora/oxide-auth

Depending on the conclusion of our discussion, in some cases, I think it would be acceptable to change the name of this project to something like "delta-sharing-sdk" and have @tdikland and @roeap take the lead in managing the project. This would allow for more efficient development, I believe.

tustvold commented 6 months ago

FWIW I've created two upstream tickets on object_store for adding support for Azure and GCP, along with links to the relevant docs. I think they should be relatively straightforward if anyone wanted to pick them up:

roeap commented 6 months ago

@tustvold - i know the azure apis quite well, and signing using keys should (hopefully) be straignforward to leverage the existing signing using the master key ... user delegated are just API cal li think :)

had a quick glance at GCP - might do that second - seems similar, cannonicalize request and sign ...

roeap commented 6 months ago

@tdikland - just FYI, started to add url signing for azure to object store https://github.com/apache/arrow-rs/pull/5259.

tdikland commented 6 months ago

@roeap @ognis1205 - FYI I have created two new issues to keep track of the discussion and next steps.