Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.31k stars 160 forks source link

Add support to translate `object_store` storage options to `daft.io.IOConfig` #2435

Open kevinzwang opened 4 months ago

kevinzwang commented 4 months ago

Several other libraries pass around a storage options dictionary that is then used by the object_store Rust crate to authenticate and do reads and writes. To allow users to more easily move to Daft, we could provide a functionality for them to use their storage options in Daft.

There are two ways to do this:

  1. Create a function like storage_options_to_io_config(options: dict[str, str]) -> IOConfig which does this conversion. One thing to figure out about this is that we would need to know what cloud provider they are using, since storage option values between cloud providers are not disjoint.
  2. Allow users to pass in a storage_options wherever they can pass io_config. In this case we can usually infer the cloud provider so it would probably be a cleaner API, but that would make it harder for users to take advantage of authentication flows that we have but object_store doesn't.

Another thing to consider is if we wanted to use the mappings in the object_store crate, which would require dipping into the Rust layer, or to copy the mappings into our own code

djouallah commented 3 months ago

I want to pass this option, but i don't know how to do it

storage_options={"allow_unsafe_rename":"true"}

samster25 commented 3 months ago

@djouallah Looks like allow_unsafe_rename is an option that is used by delta-rs rather than object store. A workaround should be to set

export AWS_S3_ALLOW_UNSAFE_RENAME=true

source: https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/

djouallah commented 2 months ago

@djouallah Looks like allow_unsafe_rename is an option that is used by delta-rs rather than object store. A workaround should be to set

export AWS_S3_ALLOW_UNSAFE_RENAME=true

source: https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/

yes, but how to do it in daft, that was my question ?

jaychia commented 2 months ago

@djouallah Looks like allow_unsafe_rename is an option that is used by delta-rs rather than object store. A workaround should be to set

export AWS_S3_ALLOW_UNSAFE_RENAME=true

source: https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/

yes, but how to do it in daft, that was my question ?

This isn't a Daft-specific configuration! It's actually from delta-rs, and isn't actually an object_store configuration either. You can just set the environment variable like so in your program, which will correctly configure delta-rs.

export AWS_S3_ALLOW_UNSAFE_RENAME=true
djouallah commented 2 months ago

no luck in a notebook :(

OSError: Generic LocalFileSystem error: Unable to copy file from /synfs/lakehouse/default/Tables/T10/daft/_delta_log/_commit_c475e751-6256-4777-8fa7-fc8f1704d785.json.tmp to /synfs/lakehouse/default/Tables/T10/daft/_delta_log/00000000000000000000.json: Function not implemented (os error 38)

samster25 commented 2 months ago

@jaychia @kevinzwang Let's expose an option to allow allow_unsafe_rename. I dug through the delta-rs code and it looks like they overload allow_unsafe_rename to do both AWS_S3_ALLOW_UNSAFE_RENAME for S3 and an allow path for other filesystems.

https://github.com/delta-io/delta-rs/blob/f05b2bf31530def92cdf7c5f22812e3ed6fe4eec/crates/aws/src/storage.rs#L419C17-L419C36

samster25 commented 2 months ago

@jaychia I think this the codepath that is getting hit when allow_unsafe_rename is set and the object store is mounted locally.

https://github.com/delta-io/delta-rs/blob/f05b2bf31530def92cdf7c5f22812e3ed6fe4eec/crates/mount/src/lib.rs#L46

samster25 commented 2 months ago

LOL it seems like they reused the key allow_unsafe_rename for both s3 and mount filesystem

https://github.com/delta-io/delta-rs/blob/f05b2bf31530def92cdf7c5f22812e3ed6fe4eec/crates/mount/src/config.rs#L29

kevinzwang commented 2 months ago

Yeah we can definitely add this. First @djouallah could you try setting export MOUNT_ALLOW_UNSAFE_RENAME=true fixes the error you saw?

djouallah commented 2 months ago

it is working and it is freaking fast !!! interesting,

question, how do I do partition by , and is there a way to control the file size, it seems daft generate really small file 15 mb

edit : it works fine in delta_rs 0.17.4 but not 0.18.2

kevinzwang commented 2 months ago

@djouallah we do not yet have the ability to do partitioned writes, but we are working on it! As for file sizes, maybe we can expose a config parameter for that, I'll take a look.

kevinzwang commented 2 months ago

edit : it works fine in delta_rs 0.17.4 but not 0.18.2

Do you see a specific error with 0.18.2, or does it just have the same behavior as when MOUNT_ALLOW_UNSAFE_RENAME is not set?