apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.49k stars 1.02k forks source link

support more date inferences in the csv reader #9331

Open universalmind303 opened 4 months ago

universalmind303 commented 4 months ago

Is your feature request related to a problem or challenge?

given a csv

mm/dd/yyyy,mm-dd-yyyy,dd/mm/yyyy,dd-mm-yyyy,yyyy/mm/dd,yyyy-mm-dd
01/01/2012,01-01-2012,01/01/2012,01-01-2012,2012/01/01,2012-01-01
02/02/2013,02-02-2013,02/02/2013,02-02-2013,2013/02/02,2013-02-02
03/03/2014,03-03-2014,03/03/2014,03-03-2014,2014/03/03,2014-03-03
04/04/2015,04-04-2015,04/04/2015,04-04-2015,2015/04/04,2015-04-04
05/05/2016,05-05-2016,05/05/2016,05-05-2016,2016/05/05,2016-05-05
06/06/2017,06-06-2017,06/06/2017,06-06-2017,2017/06/06,2017-06-06
07/07/2018,07-07-2018,07/07/2018,07-07-2018,2018/07/07,2018-07-07

currently only the last column will get converted to a Date32

❯ SELECT
  arrow_typeof("mm/dd/yyyy") AS "mm/dd/yyyy",
  arrow_typeof("mm-dd-yyyy") AS "mm-dd-yyyy",
  arrow_typeof("dd/mm/yyyy") AS "dd/mm/yyyy",
  arrow_typeof("dd-mm-yyyy") AS "dd-mm-yyyy",
  arrow_typeof("yyyy/mm/dd") AS "yyyy/mm/dd",
  arrow_typeof("yyyy-mm-dd") AS "yyyy-mm-dd"
from 'testdata/csv/dates.csv' limit 1;
+------------+------------+------------+------------+------------+------------+
| mm/dd/yyyy | mm-dd-yyyy | dd/mm/yyyy | dd-mm-yyyy | yyyy/mm/dd | yyyy-mm-dd |
+------------+------------+------------+------------+------------+------------+
| Utf8       | Utf8       | Utf8       | Utf8       | Utf8       | Date32     |
+------------+------------+------------+------------+------------+------------+

Describe the solution you'd like

it'd be nice to auto detect more of these formats. I thought at the very least YYYY/MM/DD or DD/MM/YYYY would be valid as they are both extremely common.

Describe alternatives you've considered

there really arent any good alternatives as there aren't any ergonomic ways to cast the utf8 field to date32 through the sql interface.

Additional context

No response

Lordworms commented 4 months ago

take

Lordworms commented 4 months ago

This has better to be done in arrow-rs, I would open a PR there

alamb commented 4 months ago

This has better to be done in arrow-rs, I would open a PR there

@Lordworms I suggest you (or someone) file a ticket in arrow-rs to track this work. I agree it would be best done there

Lordworms commented 4 months ago

filed https://github.com/apache/arrow-rs/pull/5444#issue-2160169292 to solve it. If that works I would add some test in datafusion and close this issue.