apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.21k stars 957 forks source link

Only recompute schema in `TypeCoercion` when necessary #10369

Open alamb opened 2 weeks ago

alamb commented 2 weeks ago

Draft as it builds on https://github.com/apache/datafusion/pull/10356

Which issue does this PR close?

closes https://github.com/apache/datafusion/issues/10365

Rationale for this change

recomputing the Schema can be expensive, so only do it when necessary

What changes are included in this PR?

  1. Update TypeCoercionRewriter to track when the expression is actually rewritten (via Transformed::yes)
  2. Only recompute the schema when an expression was rewritten

This is a big change change as it needs to is properly track when expressions were changed which required updating the entire expression rewrite

Are these changes tested?

Existing CI

Are there any user-facing changes?

No functional changes

TODO performance tests

alamb commented 2 weeks ago

There are a few tests that don't pass for reason's I haven't debugged yet but it looks like recomputing schemas may be masking issues elsewhere. Once https://github.com/apache/datafusion/pull/10356 is merged I'll do another profiling run to figure out if recomputing schema shows up in the traces and will keep hacking on this if so

alamb commented 2 days ago

Update here is I don't think I'll have a chance to work on this in the near term unless recomputing the schema shows up in planning benchmarks again.

I think given the level of effort involved, this approach is not likely to yield a large speedup (though I do think this PR's approach would be an improvement over the current code)