apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.48k stars 1.01k forks source link

[EPIC] A list of performance improvement tickets #5546

Open alamb opened 1 year ago

alamb commented 1 year ago

This has a list of performance improvements:

jaylmiller commented 1 year ago

I'd be interested in picking up one of these... is #846 currently being worked on? If not you could assign me, @alamb ? Otherwise, they all look pretty interesting to me so feel free to assign me to something else on the list

alamb commented 1 year ago

Thanks @jaylmiller !

I'd be interested in picking up one of these... is https://github.com/apache/arrow-datafusion/issues/846 currently being worked on? If not you could assign me, @alamb ? Otherwise, they all look pretty interesting to me so feel free to assign me to something else on the list

I dont think https://github.com/apache/arrow-datafusion/issues/846 is being worked on, but given that the GroupByHash now uses the row format, I am not sure how relevant it is.

Please do feel free to comment on any ticket that is interesting -- no need to have it assigned to work on something!

Thanks for all the help so far on making Sort faster

jaylmiller commented 1 year ago

Thanks @jaylmiller !

I'd be interested in picking up one of these... is #846 currently being worked on? If not you could assign me, @alamb ? Otherwise, they all look pretty interesting to me so feel free to assign me to something else on the list

I dont think #846 is being worked on, but given that the GroupByHash now uses the row format, I am not sure how relevant it is.

Please do feel free to comment on any ticket that is interesting -- no need to have it assigned to work on something!

Thanks for all the help so far on making Sort faster

Sounds good! #846 was kindof arbitrary to be honest 😅, I'll read thru them more closely and pick one that seems interesting.

alamb commented 1 year ago

Sounds good! https://github.com/apache/arrow-datafusion/issues/846 was kindof arbitrary to be honest 😅, I'll read thru them more closely and pick one that seems interesting.

Awesome -- thanks @jaylmiller

I think in general the "make aggregation faster" https://github.com/apache/arrow-datafusion/issues/4973 and high cardinality groups https://github.com/apache/arrow-datafusion/issues/5547 are the most pressing things from a performance perspective.

However, they are also the ones with the most active thought / work on them, so they probably need some more coordination, which you may or ma not be interested in doing