apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.49k stars 1.02k forks source link

Materialize Dictionaries in Group Keys #7647

Open tustvold opened 9 months ago

tustvold commented 9 months ago

Is your feature request related to a problem or challenge?

Currently grouping on a dictionary column will return dictionary-encoded group keys. Given that group keys inherently have few repeated values, especially when grouping on a single column, the use of dictionary encoding is unlikely to be yielding significant returns. Additionally following https://github.com/apache/arrow-datafusion/pull/7587 computing the dictionary is a non-trivial operation that could be eliminated

Describe the solution you'd like

When grouping on a dictionary column, e.g. Dictionary(DataType::Int32, DataType::Utf8), the returned schema should be the underlying value type, i.e. DataType::Utf8.

Describe alternatives you've considered

No response

Additional context

No response

qrilka commented 9 months ago

Hi @tustvold I wanted to look into this ticket, I see the TODO you've left in #7587. Could you advise me some easy way to to simulate this situation e.g. in datafusion-cli or maybe there is some similar test in the code base?

alamb commented 9 months ago

You can great dictionary encoded columns using arrow_cast

~Here is an example~

Here is a better example (use string dictionaries): https://github.com/apache/arrow-datafusion/blob/e23d34bae60bb2f9c496241e218bab795af3af83/datafusion/sqllogictest/test_files/topk.slt#L223-L232

qrilka commented 8 months ago

@tustvold I've spent some time getting familiar with the code for aggregations and I have a couple of questions. Maybe you could help me here? I see that for aggregates in an initial plan we have in the base case 2 nested AggregateExecs: Partial and Final(Partitioned). And every AggregateExec uses RowConverter to convert arrays into rows for its input and back from rows into arrays for its output. I assume that these conversions are basically inevitable because RecordBatches are sent between execution stages and aggregation internally operates on rows, not arrays, is my understanding correct here? It looks to me that it makes sense to convert from Dictionary to Utf8 on the first AggregateExec but it doesn't look like there's no mode like AggregateInitial. What could be the proper way out here? Maybe there could be a conversion step before aggregation added if any of the fields has a dictionary type? What would you advise?

qrilka commented 7 months ago

@alamb maybe you could help me here?

tustvold commented 7 months ago

You should not need to do any conversion, the conversion is already being done by RowConverter. Currently there is a cast annotated with a TODO linking back to this ticket. It should just be a case of removing this cast and updating the schema logic within the plan to account for this

qrilka commented 7 months ago

@tustvold I saw the conversion in RowConverter but as I understand it happens multiple times: i.e. we convert dictionary into utf8 in AggregatePartial, then cast results back to dictionary and then we do the same in AggregateFinal.

Please let me know if I get your idea right: in the case of AggregatePartial followed by AggregateFinal, the schema for the first one should have Utf8 as its output and the second aggregate will then use Utf8 both as its input and as output.

alamb commented 6 months ago

TLDR I recommend we revert this change and reopen this ticket while we reconsider how to handle this case better.

Background

This ticket caused a functional regression for us downstream in https://github.com/apache/arrow-datafusion/issues/8738 (a query that used to run started erroring).

The cause of the issue is that the LogicalPlan and ExecutionPlans schemas no longer match. I started exploring making them match in https://github.com/apache/arrow-datafusion/pull/8766

However while working on that issue, I thought more about this change and I am not sure the change described in this issue is good for multi column groupings at all.

Rationale

Dictionary Encoding on single column group keys is bad because there is no repetition in the data and therefore the Dictionary encoding is pure over head. For example, given

SELECT ... GROUP BY a                                                                

Dictionary Encoding is worse than native encoding:

┌──────┐                   ┌──────┐      ┌────┐     
│ foo  │                   │ foo  │      │ 0  │     
├──────┤                   ├──────┤      ├────┤     
│ bar  │                   │ bar  │      │ 1  │     
├──────┤                   ├──────┤      ├────┤     
│ baz  │  ────────▶        │ baz  │      │ 2  │     
├──────┤                   ├──────┤      ├────┤     
│  ff  │                   │  ff  │      │ 3  │     
├──────┤                   ├──────┤      ├────┤     
│  ..  │                   │  ..  │      │ .. │     
├──────┤                   ├──────┤      ├────┤     
│ aaz  │                   │ aaz  │      │9999│     
└──────┘                   └──────┘      └────┘     

 Group Values            values array   keys array  
   (distinct                                        
 values of a)                                       

However, the story is different when there is a multi column group key, as in that case, dictionary encoding each column can be a significant performance improvement as they are applied to each column individually and each column may have substantial redundancy. For example, given this query

SELECT ... GROUP BY a,b
┌──────┐  ┌──────┐               ┌──────┐      ┌────┐         ┌──────┐      ┌────┐      
│ foo  │  │ tag1 │               │ foo  │      │ 0  │         │ tag1 │      │ 0  │      
├──────┤  ├──────┤               ├──────┤      ├────┤         ├──────┤      ├────┤      
│ foo  │  │ tag2 │               │ baz  │      │ 0  │         │ tag2 │      │ 1  │      
├──────┤  ├──────┤               └──────┘      ├────┤         ├──────┤      ├────┤      
│ foo  │  │ tag3 │ ────────▶                   │ 0  │         │ tag3 │      │ 2  │      
├──────┤  ├──────┤                             ├────┤         ├──────┤      ├────┤      
│ foo  │  │ tag4 │                             │ 0  │         │ tag4 │      │ 3  │      
├──────┤  ├──────┤                             ├────┤         └──────┘      ├────┤      
│  ..  │  │  ..  │                             │ .. │                       │ .. │      
├──────┤  ├──────┤                             ├────┤                       ├────┤      
│ baz  │  │ tag4 │                             │ 1  │                       │ 3  │      
└──────┘  └──────┘                             └────┘                       └────┘      

       Group Values            values array   keys array    values array   keys array   
    (distinct values of             (a)           (a)            (b)           (b)      
           a,b)                                                                         

This could especially impact multi-phase grouping where dictionary encoding will save significant time hashing values for low cardinality string columns.

In fact we think we may have seen a performance regression when picking up this change downstream as well, which could also be explained by the observation above

Thus I recommend we revert this change via https://github.com/apache/arrow-datafusion/pull/8740 while we reconsider how to handle this case (maybe just for single column group by? Maybe do the dictionary encoding within the RowEncoder to avoid generating many redundant strings?

tustvold commented 6 months ago

This could especially impact multi-phase grouping where dictionary encoding will save significant time hashing values for low cardinality string columns.

If anything this is precisely the case where the old behaviour was especially inefficient. It will compute the rows, expanding the dictionary, perform the grouping, convert the rows back to strings, convert the strings back to dictionaries, convert back to strings in the row format, again expanding the dictionaries, and repeat for each stage

alamb commented 6 months ago

I want to be clear that I have no particular evidence one way or the other about the performance implications of this particular change (and I probably confused the issue with speculation in https://github.com/apache/arrow-datafusion/issues/7647#issuecomment-1879161073)

So what I would like to do is:

  1. Revert the change
  2. Reopen this ticket
  3. Work on getting a PR ready to put the change back in that both 1) doesn't cause a functional regression and 2) we have evidence improves performance
qrilka commented 5 months ago

@alamb what could be some evidence for 3.2? Is there anything in the code base or maybe in some other ticket? The plan you show makes total sense

alamb commented 5 months ago

Thanks @qrilka

@alamb what could be some evidence for 3.2?

I think we need to create a benchmark that does aggregation queries on dictionary encoded string columns (I know the existing end to end TPCH, and ClickBench benchmarks do not do this). I will file a ticket shortly about this

alamb commented 5 months ago

I filed https://github.com/apache/arrow-datafusion/issues/8791 to track adding additional test coverage, and I plan to work on that in the next few days