apache / hudi-rs

A native Rust library for Apache Hudi, with bindings into Python
https://hudi.apache.org/
Apache License 2.0
147 stars 29 forks source link

feat: support storage_options param when reading from table #139

Closed kazdy closed 1 month ago

kazdy commented 2 months ago

To integrate hudi-rs with AWS SDK for Pandas (aws wrangler), we must be able to pass botosession related aws authentication params (mostly AWS* params) directly and not only rely on env variable inference.

I want to propose adding an option to handle this:

storage_options = {"AWS_ACCESS_KEY_ID": "xxxx", "AWS_SECRET_ACCESS_KEY":"xxxx", "AWS_SECRET_ACCESS_TOKEN":"xxxx"}
hudi_table = HudiTable("/tmp/trips_table", storage_options=storage_options)
records = hudi_table.read_snapshot()

Although I want to add this for S3, it should work for other storage backends. I'm happy to contribute and add this.

xushiyan commented 2 months ago

@kazdy sounds good. feel free to take this up and send a pr

kazdy commented 1 month ago

I'll wait until #72 gets merged. I did the first strawman impl and it requires some refactoring in the Table itself.

@xushiyan I also have some questions about this, maybe you can give me your opinion on these:

  1. Should we rename Table to HudiTable?
  2. I don't know why Timeline and FileSystemView both use separate storage instances, can't they share it, maybe there's a reason why it's done this way I can't see atm?
  3. Does it make sense to introduce something that will hold both Timeline and FileSystemView (basically table state) and expose coherent API?

thanks

xushiyan commented 1 month ago

hey @kazdy

1) we keep name Table within hudi-core to avoid redundant prefix; everything in hudi-core is about Hudi. When import to other crates, we can give it an alias like HudiTable. We can also add an alias in hudi crate for external facing API when needed. As of now, no strong need for this.

2) Timeline is responsible for data stored in timeline files under .hoodie/, and FileSystemView is responsible for the data stored under the table excluding .hoodie/. It's good to keep things less coupled, unless there is a need for sharing - it's a stateless client performing IO anyway. Maybe you can make a case about why sharing it?

3) Currently Table holds Timeline and FileSystemView. You want to elaborate on what you meant by coherent API?