apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.9k stars 1.11k forks source link

Aggregation fuzz testing #12114

Open alamb opened 3 weeks ago

alamb commented 3 weeks ago

Is your feature request related to a problem or challenge?

While reviewing https://github.com/apache/datafusion/pull/11943 from @Rachelint it is becoming clear to me that the hash aggregate code is now pretty sophisticated and I am not sure our testing has kept up. In fact I couldn't come up with a great way to systematically test the new code added in https://github.com/apache/datafusion/pull/11943

Also, the code in https://github.com/apache/datafusion/pull/11627 from @korowa for skipping partial aggregates has a similar problem as it is not invoked There is also code for streaming and partial streaming group by.

All this code has unit tests, but I am not confident that all the combinations are checked. For example the code paths are affected by:

  1. Sort order of the input
  2. partitioning of the input
  3. The type of the group keys
  4. The number of groups
  5. The number of rows in each group
  6. The type of the aggregate
  7. The number of aggregates
  8. If the aggregate supports group aggregation
  9. If the groups aggregator supports partial aggregation skipping

Describe the solution you'd like

I would like a more systematic way to test this code to ensure out current code is correct but also to ensure that future changes do not introduce subtle hard to debug regressions / wrong results

Describe alternatives you've considered

What I think would be good is a test framework that:

  1. Describes an input data set (e.g. RecordBatches)
  2. Run the same query on the same input data set with different configurations (e.g. block size, input sort order, distribution of input blocks, etc)
  3. Compare the results and ensure it is the same in all cases

Parameters to randomly vary for each input:

  1. Sort order if the input
  2. target block size
  3. Number of input partitions
  4. memory limit (to force spilling)
  5. Shuffled input row distribution across blocks
  6. the skipping partial aggregation enabling or not

Test cases:

  1. Types of the group keys
  2. single/multiple column groups
  3. Number of groups (low/high cardinality)
  4. Different aggregates

Additional context

We also have some great sql fuzz coverage in https://github.com/datafusion-contrib/datafusion-sqlancer from @2010YOUY01, but I think that focuses on the queries themselves, rather than the setup (block size, input order, etc)

Existing aggregate coverage in datafusion core fuzz test (cargo test --test fuzz

2010YOUY01 commented 3 weeks ago

Additional context

We also have some great sql fuzz coverage in https://github.com/datafusion-contrib/datafusion-sqlancer from @2010YOUY01, but I think that focuses on the queries themselves, rather than the setup (block size, input order, etc)

I agree SQLancer is not the best choice for aggregation-specific fuzzing (though doable), due to:

  1. It takes a lot of effort to try all possible configuration knobs on randomly generated data
  2. It's random SQL + random config, the randomly generated SQL will be complex and with deeply nested exprs, which will be hard to reduce and investigate

So now I plan to cover more SQL features and try to find easy to identify and fix bugs, configuration fuzzing is less prioritized for SQLancer

So I think rust-level fuzzing is better.

Besides, I think we can also find some comprehensive aggregation queries to do some SQL level fuzzing (Fixed SQL + random config, and check under different config the query always gives the same result)

2010YOUY01 commented 3 weeks ago

I am also curious what is the compatible matrix for all aggregation optimizations (like can skip-partial-aggregation and external-aggregation triggered in the same execution, for all combinations) Specifying them in configuration manual and code doc can make it easier to understand the aggregation details, and also write more effective tests

Rachelint commented 3 weeks ago

I am also curious what is the compatible matrix for all aggregation optimizations (like can skip-partial-aggregation and external-aggregation triggered in the same execution, for all combinations) Specifying them in configuration manual and code doc can make it easier to understand the aggregation details, and also write more effective tests

In my knowledge, it may be:

spilling streaming(sorted) skip partial blocked emission
spilling x o x
streaming(sorted) x o x
skip partial o x o
blocked emission x x o
Rachelint commented 3 weeks ago

As I think, can we run the basic aggregation without any optimizations enabled and use its output as expected first, and then we modify the options to enable different optimizations and their combinations, and compare their result with expected?

alamb commented 2 weeks ago

As I think, can we run the basic aggregation without any optimizations enabled and use its output as expected first, and then we modify the options to enable different optimizations and their combinations, and compare their result with expected?

Yes, I think that is likely a good plan. In my mind, as long as all the code paths get the same answer that will increase our confidence that the system is computing the correct results in the different places

Rachelint commented 2 weeks ago

As I think, can we run the basic aggregation without any optimizations enabled and use its output as expected first, and then we modify the options to enable different optimizations and their combinations, and compare their result with expected?

Yes, I think that is likely a good plan. In my mind, as long as all the code paths get the same answer that will increase our confidence that the system is computing the correct results in the different places

Ok, maybe just start from making a simple sketch, and try to impl current aggr fuzz tests based on it?

I can have a try on it, and help to push forward about enabling #11943 by default,

alamb commented 2 weeks ago

As I think, can we run the basic aggregation without any optimizations enabled and use its output as expected first, and then we modify the options to enable different optimizations and their combinations, and compare their result with expected?

Yes, I think that is likely a good plan. In my mind, as long as all the code paths get the same answer that will increase our confidence that the system is computing the correct results in the different places

Ok, maybe just start from making a simple sketch, and try to impl current aggr fuzz tests based on it?

I can have a try on it, and help to push forward about enabling #11943 by default,

Thank you -- that would be awesome. I can't keep up anymore with everything that is going on

In terms of helping along DataFusion performance, my plan was to focus first on getting StringView enabled and then switch more to focusing on the blocked intermediate state.

I will however, prioritize time for reviewing aggregation testing as I think testing in general is really important for DataFusion

Rachelint commented 2 weeks ago

take