delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.2k stars 395 forks source link

feat(rust): make PartitionWriter public #2525

Closed adriangb closed 4 months ago

adriangb commented 4 months ago

I would like to use PartitionWriter within my own crate. Would you mind making this public so I don't have to vendor it and all of the associated code?

github-actions[bot] commented 4 months ago

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

rtyler commented 4 months ago

@adriangb I'm not opposed to the idea, can you share more about how this can be used outside of the delta context, I'm curious

adriangb commented 4 months ago

This isn't to use outside of the delta context necessarily, just hook in more low level and with less code churn.

I have a Rust service that does buffering of data before writing to Delta. I control all of the writes going to Delta so I want to avoid all of the overhead of checking that the schemas match, etc which happens if you go through the public APIs. I also want to take control of the write retry loop because I may have to switch to pessimistic concurrency control under high write load (take out a lock on a low latency system like Redis only at the stage of writing the commit) to avoid write contention.

Currently data is written out manually using ArrowWriter as hive partitioned parquet. So if I can hook into PartitionWriter it's relatively little code changes (actually saves me a good bit of code doing the partitioning) and I can then go implement the writing of the commits as a later step to convert to a delta table.

rtyler commented 4 months ago

Sounds reasonable to me! I think this API is stable enough to make public in the next release, please run a cargo fmt and then this should merge

adriangb commented 4 months ago

done @rtyler ! Seems like it dismissed your review :(

roeap commented 4 months ago

@rtyler @adriangb - recently I have been thinking if we should adopt the visibility crate to expose a larger api behind a feature gate and less guarantees. We uses this - so gar successfully I believe- in kernel ...

There also is some writing stuff - e.g. muti-part put from object_store on my todo list, that may affect this.

adriangb commented 4 months ago

@roeap not sure I understood fully, are you saying that this right now should not be made public or that there is an opportunity for a more thoughtful approach to a public API within the rust crates in the future?

ion-elgreco commented 4 months ago

@roeap best to wait for object store 0.10 before working on the multi part

@aersam also already put some work there

roeap commented 4 months ago

@adriangb - i think we can make this public now, but maybe behind a feature gate. The visibility crate enables this.

@ion-elgreco - yes, still some things to cross off first, and we'll work from where we are, when i get to it 🙂.

adriangb commented 4 months ago

i think we can make this public now, but maybe behind a feature gate. The visibility crate enables this.

Interesting, TIL!