Open alamb opened 4 days ago
BTW we can verify that this is working as expected after merging
Then you can do
cd datafusion-cli
RUST_LOG=debug cargo run -- -c "create or replace table foo(x decimal(10,3), y int) as values (10.0, 100), (21.2, 200), (33.0, 300); select count(*) from foo group by x, y";
You should not see any lines about Creating GroupValuesRows
. Here is what is printed out on main
[2024-11-20T22:08:58Z DEBUG datafusion_physical_plan::aggregates::group_values::row] Creating GroupValuesRows for schema: Field { name: "x", data_type: Decimal128(10, 3), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "y", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }
take
@alamb For this pr, will it need its own custom column implementation for decimal128 instead of instantiate_primitive!, similar to how byte, byteview, stringview, etc. are dealt with? I am thinking that due to the parameters
@alamb For this pr, will it need its own custom column implementation for decimal128 instead of instantiate_primitive!, similar to how byte, byteview, stringview, etc. are dealt with? I am thinking that due to the parameters
I think you can use Vec<u128>
for the underlying storage (aka use the PrimtiveGroup struct), for the underlying primitive type, but then call this function to:
To adjust the type at the end
For example like this: https://github.com/apache/datafusion/blob/207e855f235401b997269fa858af641cf2a7b81e/datafusion/physical-plan/src/aggregates/topk/heap.rs#L156
So the idea is that you keep the actual group values as native types (u128 in this case) as we are only comparing their values
Does that make sense?
Yep, thanks!
@alamb For this pr, will it need its own custom column implementation for decimal128 instead of instantiate_primitive!, similar to how byte, byteview, stringview, etc. are dealt with? I am thinking that due to the parameters
I think you can use
Vec<u128>
for the underlying storage (aka use the PrimtiveGroup struct), for the underlying primitive type, but then call this function to:To adjust the type at the end
For example like this:
So the idea is that you keep the actual group values as native types (u128 in this case) as we are only comparing their values
Does that make sense?
It seems adjust_output_array
clone the array, is it better to use with_data_type
to avoid the clone?
It seems
adjust_output_array
clone the array, is it better to usewith_data_type
to avoid the clone?
I think they are roughly equivalent - if we can avoid cloning the array that certainly seems better
Is your feature request related to a problem or challenge?
In https://github.com/apache/datafusion/pull/12269 @jayzhan211 made significant improvements to how group values are stored in multi-column aggregations.
Specifically for queries like
The improvement relies on implementing specialized versions of
GroupColumn
for the types ofcol1
,colN
We have implemented the primitive types and Strings/StringViews now, but we have not implemented all types
This means queries like
Will fall back to the slower (but general)
GroupValuesRows
: https://github.com/apache/datafusion/blob/a6586ccfffb569461e357817529ecf6647c6d62e/datafusion/physical-plan/src/aggregates/group_values/row.rs#L40-L41Describe the solution you'd like
Implement
GroupColumn
forDecimal128
types.You can see how to do this here:
https://github.com/apache/datafusion/blob/e4bd57918b1725f37f3f500c503c07b1c1bf90bf/datafusion/physical-plan/src/aggregates/group_values/mod.rs#L117-L121
@jonathanc-n also made a really nice PR here
and the make sure there are tests for each of those types in queries that group on multiple columns
Describe alternatives you've considered
No response
Additional context
Here is an example for how this was done for Strings: https://github.com/apache/datafusion/pull/12809