apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.29k stars 973 forks source link

rewrite nvl function #9413

Closed guojidan closed 2 weeks ago

guojidan commented 3 months ago

Which issue does this PR close?

Closes #9365 .

Rationale for this change

now, nvl2 function is supported, so we don't need keep nvl code, we can rewrite nvl to nvl2

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jonahgao commented 3 months ago

I think that this manner of rewriting might have several downsides that need to be considered.

1) Because NVL no longer exists in the function list, some of its features will be missing. For example, the feature in issue #9392, and the hint after entering wrong parameters.

On this PR's branch:

DataFusion CLI v36.0.0
❯ select nvl(1);
Error during planning: Invalid function 'nvl'.
Did you mean 'nanvl'?

The hint on the main branch would be better.

❯ select nvl(1);
Error during planning: No function matches the given name and argument types 'nvl(Int64)'. You might need to add explicit type casts.
    Candidate functions:
    nvl(Boolean/UInt8/UInt16/UInt32/UInt64/Int8/Int16/Int32/Int64/Float32/Float64/Utf8/LargeUtf8, Boolean/UInt8/UInt16/UInt32/UInt64/Int8/Int16/Int32/Int64/Float32/Float64/Utf8/LargeUtf8)
  1. There may be incorrect result when NVL's parameters contain a volatile expression. For example, given an expression expr is nullif(random()>0.5, true). The result of select nvl(expr, false)should always be false. But if it is rewritten as nvl(expr, expr, false), it could potentially yield a NULL value, when the random result of the first expr is less than 0.5, and the random result of the second expr is greater than 0.5.

I have performed such a test on this branch.

DataFusion CLI v36.0.0
❯  select nvl(nullif(random()>0.5, true), false) is null from (unnest(range(1,100)));
+--------------------------------------------------------------------------------------------------------------------------+
| nvl2(nullif(random() > Float64(0.5),Boolean(true)),nullif(random() > Float64(0.5),Boolean(true)),Boolean(false)) IS NULL |
+--------------------------------------------------------------------------------------------------------------------------+
| true                                                                                                                     |
| false                                                                                                                    |
| false                                                                                                                    |
| false                                                                                                                    |
| false                                                                                                                    |

There is approximately a 25% chance of an incorrect result.

❯ select count(*) from (
   select
      nvl(nullif(random()>0.5, true), false)
   from (unnest(range(1,10000)))
) as t(a)
where a is null;
+----------+
| COUNT(*) |
+----------+
| 2517     |
+----------+
guojidan commented 2 months ago

I think that this manner of rewriting might have several downsides that need to be considered.

  1. Because NVL no longer exists in the function list, some of its features will be missing. For example, the feature in issue Add ScalarUDFs in missing function hints #9392, and the hint after entering wrong parameters.

On this PR's branch:

DataFusion CLI v36.0.0
❯ select nvl(1);
Error during planning: Invalid function 'nvl'.
Did you mean 'nanvl'?

The hint on the main branch would be better.

❯ select nvl(1);
Error during planning: No function matches the given name and argument types 'nvl(Int64)'. You might need to add explicit type casts.
  Candidate functions:
  nvl(Boolean/UInt8/UInt16/UInt32/UInt64/Int8/Int16/Int32/Int64/Float32/Float64/Utf8/LargeUtf8, Boolean/UInt8/UInt16/UInt32/UInt64/Int8/Int16/Int32/Int64/Float32/Float64/Utf8/LargeUtf8)
  1. There may be incorrect result when NVL's parameters contain a volatile expression. For example, given an expression expr is nullif(random()>0.5, true). The result of select nvl(expr, false)should always be false. But if it is rewritten as nvl(expr, expr, false), it could potentially yield a NULL value, when the random result of the first expr is less than 0.5, and the random result of the second expr is greater than 0.5.

I have performed such a test on this branch.

DataFusion CLI v36.0.0
❯  select nvl(nullif(random()>0.5, true), false) is null from (unnest(range(1,100)));
+--------------------------------------------------------------------------------------------------------------------------+
| nvl2(nullif(random() > Float64(0.5),Boolean(true)),nullif(random() > Float64(0.5),Boolean(true)),Boolean(false)) IS NULL |
+--------------------------------------------------------------------------------------------------------------------------+
| true                                                                                                                     |
| false                                                                                                                    |
| false                                                                                                                    |
| false                                                                                                                    |
| false                                                                                                                    |

There is approximately a 25% chance of an incorrect result.

❯ select count(*) from (
   select
      nvl(nullif(random()>0.5, true), false)
   from (unnest(range(1,10000)))
) as t(a)
where a is null;
+----------+
| COUNT(*) |
+----------+
| 2517     |
+----------+

yes, need to redesign rewrite manner

alamb commented 2 months ago

Marking this PR as draft as I don't think it is waiting for more review (and I am trying to get through the PR review backlog).

Do we have any alternate approaches? Or is the plan to leave both nvl and nvl2? Or something else?

github-actions[bot] commented 3 weeks ago

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.