Open alamb opened 2 weeks ago
BTW my personal plans over the next few months are likely going to be focus on consolidating some of the gains / improvements we have made recently. That includes:
Improve the project's documentation
Performance wise I plan to
I am not sure if this is the place for it but I have been putting a lot of work into dft
and plan on doing a release before end of year.
I am not sure if this is the place for it but I have been putting a lot of work into
dft
and plan on doing a release before end of year.
For anyone else following along, dft
is https://github.com/datafusion-contrib/datafusion-dft
I am not sure if this is the place for it but I have been putting a lot of work into
dft
and plan on doing a release before end of year.
I may want to help delta / iceberg integration, I think they are quite important. But I will work on performance task first
I may want to help delta / iceberg integration, I think they are quite important. But I will work on performance task first
@jayzhan211 I agree, they are very important. Unfortunately, we have been held up because of the crates using different versions of datafusion. The idea was to converge on 42 - which iceberg and hudi currently use but deltalake (which we already have an integration for) is on 41 and hasnt been able to upgrade yet. It looks like they are skipping version 42 now and will use 43 - so hopefully this is resolved soon.
Here is some relevant work
@jayzhan211 and to be more explicit on my release plans, i did not plan on releasing until iceberg and hudi were added.
LogicalType is important too https://github.com/apache/datafusion/issues/12622
I may want to help delta / iceberg integration, I think they are quite important. But I will work on performance task first
@jayzhan211 I agree, they are very important. Unfortunately, we have been held up because of the crates using different versions of datafusion.
I think another potentially very interesting approach here will be to use the FFI
bindings from @timsaucer:
The idea there would be to wrap the delta / iceberge in a stable ABI (aka the FFI bindings) so we could call delta.rs / iceberg which used a different version of DataFusion from dft
.
On the python side, getting better integration with the python delta-rs package was the entire reason for pushing for the FFI bindings. I have branches ready to go for datafusion-python
and delta-rs
as soon as 43.0.0 releases. I also have tested it with a few of the other table providers in datafusion-contrib
.
For the pure rust implementations, I think it would be best to not cross the unsafe
FFI boundary if you don't have to. Unfortunately that does put additional dependencies on the other crates updating at a reasonable pace.
I think as soon as DataFusion 43.0.0 is released we'll be able to test it out:
dft
to DataFusion 43dft
to delta-rs with older datafusion version) It should be quite sweet
I don't know if this discussion is the place we want to track work in the other related projects, but my top goals for 2024 Q4 are:
datafusion-python
and delta-rs
integration over the line using FFI https://github.com/apache/datafusion-python/pull/921datafusion-python
https://github.com/apache/datafusion-python/issues/842datafusion-python
or datafusion
https://github.com/apache/datafusion-python/issues/936 - this might be Q1 2025QueryPlanner
to datafusion-python
and implementing a custom query planner in datafusion-ray
so that we can run all datafusion commands that perform execute()
on the dataframe and it just knows to how to run them distributedSessionContext
via FFIOne thing thats not clear to me with the FFI approach is who the intended owner of the bindings are - should it be dft
as i dont want to worry about my deps being on different datafusion versions or is it more for the table providers crates (iceberg, deltalake, hudi, etc)?
Of course in the short term it could be prototyped in dft
and contributed back to those repos but im asking more in the target state where the appropriate home would be.
One thing thats not clear to me with the FFI approach is who the intended owner of the bindings are - should it be
dft
as i dont want to worry about my deps being on different datafusion versions or is it more for the table providers crates (iceberg, deltalake, hudi, etc)?Of course in the short term it could be prototyped in
dft
and contributed back to those repos but im asking more in the target state where the appropriate home would be.
The version of DataFusion used in the bindings has to match the client program (dft
in this case) so I don't think they can go in the delta/iceberg crates
One thing we might be able to do is have a separate crate like datafusion-delta-table-provider
that has different feature flags for different DataFusion versions 🤔 -- but I am now more or less wildly speculating
@matthewmturner Do you have anything in mind moving forward for integrating the rest of the data lakes? (such as a list of what needs to be done moving forward)
@matthewmturner Do you have anything in mind moving forward for integrating the rest of the data lakes? (such as a list of what needs to be done moving forward)
Yes, now that DataFusion v43 has been released I am hoping that the rust implementations of the three main data lake formats (Deltalake / Iceberg / Hudi) update to that version. Then I will:
TableProviderFactory
I am interested in the FFI bindings but I don't anticipate working on that prior to the current release I am planning.
I have had a few days to reflect , and I personally think making it easy to integrate DataFusion into the "open data lake" stack might be my top priority over the coming months
@julienledem wrote up a very nice piece descsribing this The advent of the Open Data Lake
In my mind, the specific work this entails stuff like
More to come
@alamb can you expand on the different runtimes point? Are referring to having a dedicated tokio runtime for CPU bound work? I actually have a ticket open for that - if that is a very important item for you I can add it to the list to do before releasing.
@alamb can you expand on the different runtimes point? Are referring to having a dedicated tokio runtime for CPU bound work? I actually have a ticket open for that - if that is a very important item for you I can add it to the list to do before releasing.
This is what I had in mind:
Thanks for the link to the dft one. That is a good one
@alamb i will work on that next. will ping you when ready for review.
More to come
I filed
to try and organize my thoughts here better
Is your feature request related to a problem or challenge?
The last roadmap discussion we had seems to have worked out well to galvanize and get us organized around some common goals
Describe the solution you'd like
Let's collect any projects that people think they are likely to spend time on or projects that the broader community would really like to see done and write them down!
Describe alternatives you've considered
No response
Additional context
No response