[DISCUSSION] 2024 Q4 / 2025 Q1 Roadmap

alamb commented 2 weeks ago

Is your feature request related to a problem or challenge?

The last roadmap discussion we had seems to have worked out well to galvanize and get us organized around some common goals

https://github.com/apache/datafusion/issues/11442

Describe the solution you'd like

Let's collect any projects that people think they are likely to spend time on or projects that the broader community would really like to see done and write them down!

Describe alternatives you've considered

No response

Additional context

No response

alamb commented 2 weeks ago

BTW my personal plans over the next few months are likely going to be focus on consolidating some of the gains / improvements we have made recently. That includes:

External communication / documentation
with @SamSynnada like https://github.com/apache/datafusion/discussions/13049
https://github.com/apache/datafusion/issues/11631

Improve the project's documentation

Performance wise I plan to

help work on more advanced parquet predicate pushdown with @XiangpengHao https://github.com/apache/datafusion/issues/3463
Continue to flesh out the grouping code: https://github.com/apache/datafusion/issues/12680

matthewmturner commented 2 weeks ago

I am not sure if this is the place for it but I have been putting a lot of work into dft and plan on doing a release before end of year.

alamb commented 2 weeks ago

I am not sure if this is the place for it but I have been putting a lot of work into dft and plan on doing a release before end of year.

For anyone else following along, dft is https://github.com/datafusion-contrib/datafusion-dft

jayzhan211 commented 2 weeks ago

I am not sure if this is the place for it but I have been putting a lot of work into dft and plan on doing a release before end of year.

I may want to help delta / iceberg integration, I think they are quite important. But I will work on performance task first

matthewmturner commented 2 weeks ago

I may want to help delta / iceberg integration, I think they are quite important. But I will work on performance task first

@jayzhan211 I agree, they are very important. Unfortunately, we have been held up because of the crates using different versions of datafusion. The idea was to converge on 42 - which iceberg and hudi currently use but deltalake (which we already have an integration for) is on 41 and hasnt been able to upgrade yet. It looks like they are skipping version 42 now and will use 43 - so hopefully this is resolved soon.

Here is some relevant work

matthewmturner commented 2 weeks ago

@jayzhan211 and to be more explicit on my release plans, i did not plan on releasing until iceberg and hudi were added.

jayzhan211 commented 2 weeks ago

LogicalType is important too https://github.com/apache/datafusion/issues/12622

alamb commented 2 weeks ago

I may want to help delta / iceberg integration, I think they are quite important. But I will work on performance task first

@jayzhan211 I agree, they are very important. Unfortunately, we have been held up because of the crates using different versions of datafusion.

I think another potentially very interesting approach here will be to use the FFI bindings from @timsaucer:

The idea there would be to wrap the delta / iceberge in a stable ABI (aka the FFI bindings) so we could call delta.rs / iceberg which used a different version of DataFusion from dft.

timsaucer commented 2 weeks ago

On the python side, getting better integration with the python delta-rs package was the entire reason for pushing for the FFI bindings. I have branches ready to go for datafusion-python and delta-rs as soon as 43.0.0 releases. I also have tested it with a few of the other table providers in datafusion-contrib.

For the pure rust implementations, I think it would be best to not cross the unsafe FFI boundary if you don't have to. Unfortunately that does put additional dependencies on the other crates updating at a reasonable pace.

alamb commented 2 weeks ago

I think as soon as DataFusion 43.0.0 is released we'll be able to test it out:

Update dft to DataFusion 43
Implement a crate binding (in dft to delta-rs with older datafusion version)

It should be quite sweet

timsaucer commented 2 weeks ago

I don't know if this discussion is the place we want to track work in the other related projects, but my top goals for 2024 Q4 are:

Get the datafusion-python and delta-rs integration over the line using FFI https://github.com/apache/datafusion-python/pull/921
Adding a user tutorial for datafusion-python https://github.com/apache/datafusion-python/issues/842
Evaluating cuDF integration either in datafusion-python or datafusion https://github.com/apache/datafusion-python/issues/936 - this might be Q1 2025
Adding QueryPlanner to datafusion-python and implementing a custom query planner in datafusion-ray so that we can run all datafusion commands that perform execute() on the dataframe and it just knows to how to run them distributed
Probably looking to expose more of the SessionContext via FFI

matthewmturner commented 2 weeks ago

One thing thats not clear to me with the FFI approach is who the intended owner of the bindings are - should it be dft as i dont want to worry about my deps being on different datafusion versions or is it more for the table providers crates (iceberg, deltalake, hudi, etc)?

Of course in the short term it could be prototyped in dft and contributed back to those repos but im asking more in the target state where the appropriate home would be.

alamb commented 2 weeks ago

One thing thats not clear to me with the FFI approach is who the intended owner of the bindings are - should it be dft as i dont want to worry about my deps being on different datafusion versions or is it more for the table providers crates (iceberg, deltalake, hudi, etc)?

Of course in the short term it could be prototyped in dft and contributed back to those repos but im asking more in the target state where the appropriate home would be.

The version of DataFusion used in the bindings has to match the client program (dft in this case) so I don't think they can go in the delta/iceberg crates

One thing we might be able to do is have a separate crate like datafusion-delta-table-provider that has different feature flags for different DataFusion versions 🤔 -- but I am now more or less wildly speculating

jonathanc-n commented 1 week ago

@matthewmturner Do you have anything in mind moving forward for integrating the rest of the data lakes? (such as a list of what needs to be done moving forward)

matthewmturner commented 1 week ago

@matthewmturner Do you have anything in mind moving forward for integrating the rest of the data lakes? (such as a list of what needs to be done moving forward)

Yes, now that DataFusion v43 has been released I am hoping that the rust implementations of the three main data lake formats (Deltalake / Iceberg / Hudi) update to that version. Then I will:

Update the existing Deltalake integration
Refresh the current Hudi PR
Add Iceberg integration which should be pretty easy now that it implements TableProviderFactory

I am interested in the FFI bindings but I don't anticipate working on that prior to the current release I am planning.

alamb commented 1 week ago

I have had a few days to reflect , and I personally think making it easy to integrate DataFusion into the "open data lake" stack might be my top priority over the coming months

@julienledem wrote up a very nice piece descsribing this The advent of the Open Data Lake

In my mind, the specific work this entails stuff like

Making it easier to use iceberg/delta/hudi with DataFusion
Document different tokio runtimes
Make parquet reader in arrow-rs faster/better on remote object stores

More to come

matthewmturner commented 1 week ago

@alamb can you expand on the different runtimes point? Are referring to having a dedicated tokio runtime for CPU bound work? I actually have a ticket open for that - if that is a very important item for you I can add it to the list to do before releasing.

alamb commented 1 week ago

@alamb can you expand on the different runtimes point? Are referring to having a dedicated tokio runtime for CPU bound work? I actually have a ticket open for that - if that is a very important item for you I can add it to the list to do before releasing.

This is what I had in mind:

https://github.com/apache/datafusion/issues/12393

Thanks for the link to the dft one. That is a good one

matthewmturner commented 1 week ago

@alamb i will work on that next. will ping you when ready for review.

alamb commented 4 days ago

More to come

I filed

https://github.com/apache/datafusion/issues/13456

to try and organize my thoughts here better

apache / datafusion