apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.86k stars 1.1k forks source link

feat: allow math functions to parse input from Utf8 #9302

Open caicancai opened 6 months ago

caicancai commented 6 months ago

Is your feature request related to a problem or challenge?

log10('-Infinity'); log2('-Infinity'); power('Infinity',2)

As above, using the + infinity function in some math functions will report an error in arrow-datafusion, but mysql and posgtres will not report an error. I think the + infinity parameter is meaningful in mathematics.

--postgres
postgres=# select power('Infinity', -2);
 power 
-------
     0
(1 row)postgres=# select log('Infinity', 2);
 log 
-----
   0
(1 row)postgres=# select log10('Infinity');
  log10   
----------
 Infinity
(1 row)postgres=# select power('Infinity', 2);
  power   
----------
 Infinity
(1 row)

--mysql
mysql> select log10('+Infinity');
+--------------------+
| log10('+Infinity') |
+--------------------+
| NULL |
+--------------------+
1 row in set, 2 warnings (0.00 sec)mysql> select power('Infinity', 2);
+----------------------+
| power('Infinity', 2) |
+----------------------+
| 0 |
+----------------------+
1 row in set, 1 warning (0.00 sec)mysql> select power('-Infinity', 2);
+-----------------------+
| power('-Infinity', 2) |
+-----------------------+
| 0 |
+-----------------------+
1 row in set, 1 warning (0.00 sec)mysql> select log10('-Infinity');
+--------------------+
| log10('-Infinity') |
+--------------------+
| NULL |
+--------------------+
1 row in set, 2 warnings (0.00 sec)

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

caicancai commented 6 months ago

@alamb Maybe I need some opinions from you. In my opinion, this is also a complicated issue.

Jefffrey commented 6 months ago

Datafusion does actually support this it seems:

❯ select log(arrow_cast('+Infinity', 'Float32'));
+------------------------+
| log(Utf8("+Infinity")) |
+------------------------+
| inf                    |
+------------------------+
1 row in set. Query took 0.002 seconds.

❯ select log(arrow_cast('-Infinity', 'Float32'));
+------------------------+
| log(Utf8("-Infinity")) |
+------------------------+
| NaN                    |
+------------------------+
1 row in set. Query took 0.002 seconds.

❯

Could you elaborate on which math functions are throwing error in cases of infinity input?

caicancai commented 6 months ago

Datafusion does actually support this it seems:

❯ select log(arrow_cast('+Infinity', 'Float32'));
+------------------------+
| log(Utf8("+Infinity")) |
+------------------------+
| inf                    |
+------------------------+
1 row in set. Query took 0.002 seconds.

❯ select log(arrow_cast('-Infinity', 'Float32'));
+------------------------+
| log(Utf8("-Infinity")) |
+------------------------+
| NaN                    |
+------------------------+
1 row in set. Query took 0.002 seconds.

❯

Could you elaborate on which math functions are throwing error in cases of infinity input?

Thank you for your reply. It seems that you need to do some cast conversion. Can't you log2('-Infinity') directly?

Jefffrey commented 6 months ago

Yeah that would seem reasonable. I guess it would involve changing the signatures of the math functions to also accept Utf8 data type, then try parse into float (or something like that)

Jefffrey commented 6 months ago

I've updated the issue title to better reflect this :+1:

(Since less of an issue with infinity, and more just a general data type issue with math functions)

caicancai commented 6 months ago

I've updated the issue title to better reflect this 👍

(Since less of an issue with infinity, and more just a general data type issue with math functions)

Very happy to see the way arrow-datafusion handles this, it's a way I would like to see, I think doing a cast is a very rigorous behavior

caicancai commented 6 months ago

@Jefffrey But I think this cast should be an implicit conversion. If not, maybe we should indicate this in the tests and documentation.

Jefffrey commented 6 months ago

@Jefffrey But I think this cast should be an implicit conversion. If not, maybe we should indicate this in the tests and documentation.

Yes, I think this is what I was trying to convey, that we should allow log('100.0') to be implicitly converted to log(100.0) since we can already cast from Utf8 to Float. I just might be wrong on the exact implementation details as I'm not as familiar with the internals of the function related code :slightly_smiling_face:

caicancai commented 6 months ago

@Jefffrey But I think this cast should be an implicit conversion. If not, maybe we should indicate this in the tests and documentation.

Yes, I think this is what I was trying to convey, that we should allow log('100.0') to be implicitly converted to log(100.0) since we can already cast from Utf8 to Float. I just might be wrong on the exact implementation details as I'm not as familiar with the internals of the function related code 🙂

Maybe I want to try it, but I'm not sure I can pull it off

caicancai commented 6 months ago

@Jefffrey I will try to add some tests first and then find the problem. At present, the math function should have some negative tests that need to be improved.

alamb commented 6 months ago

@Jefffrey But I think this cast should be an implicit conversion. If not, maybe we should indicate this in the tests and documentation.

Yes, I think this is what I was trying to convey, that we should allow log('100.0') to be implicitly converted to log(100.0) since we can already cast from Utf8 to Float. I just might be wrong on the exact implementation details as I'm not as familiar with the internals of the function related code 🙂

I think this process is called "coercion" in the datafusion codebase.

So we would need to add a coercion from DataType::Utf8 to float perhaps

caicancai commented 6 months ago

I think this process is called "coercion" in the datafusion codebase.

So we would need to add a coercion from DataType::Utf8 to float perhaps

double or float?

alamb commented 6 months ago

So we would need to add a coercion from DataType::Utf8 to float perhaps

double or float?

I think we would probably need the coercion for both Float32 and Float64

caicancai commented 6 months ago

@alamb I have some ideas about type conversion. I am trying out my ideas on calcite. If it makes sense, I will transplant it to arrow-datafusion. This may take some time.

alamb commented 6 months ago

@alamb I have some ideas about type conversion. I am trying out my ideas on calcite. If it makes sense, I will transplant it to arrow-datafusion. This may take some time.

Thanks @caicancai -- BTW I find following the example of existing systems is helpful in cases like this (there is no need to reinvent different coercion rules)

For example, postgres appears to happily coerce UTF8 -> float:

postgres=# select sqrt('1.4');
        sqrt
--------------------
 1.1832159566199232
(1 row)