apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.43k stars 3.51k forks source link

[C++] Is CSV reader's TimestampParser usable elsewhere? #31341

Open asfimport opened 2 years ago

asfimport commented 2 years ago

The TimestampParser seems to be able to cycle through several formats. This sort of functionality would be very useful for some of the lubridate bindings that need to behave in a similar way.


library(arrow)
library(fs)
library(readr)
library(tibble)

tf <- fs::file_temp(ext = "csv")
fs::file_create(tf)

sample_times <- tibble(a = c("09/13/2013", "25/12/1998", "09-13-13", "23_Feb_2022", "09/13/2018"))
write_csv(sample_times, tf)

read_csv_arrow(tf, 
               as_data_frame = TRUE,
               timestamp_parsers = c("%m/%d/%Y", "%d/%m/%Y", "%m-%d-%y", "%d_%b_%Y"))
#> # A tibble: 5 × 1
#>   a                  
#>   <dttm>             
#> 1 2013-09-13 01:00:00
#> 2 1998-12-25 00:00:00
#> 3 2013-09-13 01:00:00
#> 4 2022-02-23 00:00:00
#> 5 2018-09-13 01:00:00

For example, in lubridate, the ymd() cycles through all possible formats that have year-month-date components in the right order (e.g. "%Y-%m-%d", "%y-%m-%d", "%Y-%b-%d", "%y-%b-%d", "%Y-%B-%d", "%y-%b-%d", etc).

I guess my question is: Can we factor this CSV reader feature to be usable elsewhere? This was the bit that caught my attention: "using the virtual parser interface in arrow/util/value_parsing.h", and told me that using it elsewhere might be a possibility.

Reporter: Dragoș Moldovan-Grünfeld / @dragosmg

Related issues:

Note: This issue was originally created as ARROW-15912. Please see the migration documentation for further details.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Well, the timestamp parser functionality is available in arrow/util/values_parsing.h. You'll have to reimplement the logic to loop through parsers yourself, but that should be close to trivial.

Does this issue need to be kept open?