apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.04k stars 1.14k forks source link

Decide on CAST behaviour for invalid inputs #120

Open alamb opened 3 years ago

alamb commented 3 years ago

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-10793

This is a placeholder for now. See discussion on [https://github.com/apache/arrow/pull/8794]

Briefly, the issue is do we want CAST to return null for invalid inputs or throw an error. Spark has different behavior depending on whether ANSI mode is enabled or not.

I'm not sure if this is a DataFusion specific or a more general Arrow issue yet. It needs a discussion.

 

 

alamb commented 3 years ago

Comment from Mike Seddon(MikeSeddonAU) @ 2020-12-02T22:44:31.207+0000:

I strongly feel that DataFusion should adopt the ANSI-style strict typing rather than silent error suppression and return of NULL values as intuitively (due to years of using DBMS) users expect that if an error was not thrown then all operations were completed successfully.

The default Spark behavior was inherited from Hive SQL which I assume was originally built to support a business where absolute precision was not necessarily important. As part of the Spark 3.0 release a huge amount of effort was put in to comply with ANSI standard SQL ([https://spark.apache.org/docs/3.0.0/sql-ref-ansi-compliance.html|https://spark.apache.org/docs/3.0.0/sql-ref-ansi-compliance.html)]) which is obviously a lot harder to retrofit than start with.

This also goes wider than type conversions as adopting ANSI SQL standards (including functionality like [https://github.com/apache/arrow/pull/8688] which I think requires a CASE statement in ANSI SQL) should maybe be agreed by the PMC to give a framework for assessing PRs against. Perhaps this ticket should be changed to a discussion of which dialect of SQL DataFusion aims to support.