apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.6k stars 3.54k forks source link

[R] Docs are not clear on expected behaviour of date parsing functions (e.g. dmy()) on Windows vs. Linux/MacOS #39754

Open hamgamb opened 9 months ago

hamgamb commented 9 months ago
library(arrow)
#> Warning: package 'arrow' was built under R version 4.3.2
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:arrow':
#> 
#>     duration
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

arrow_table(date = "12JAN2004") %>%
  mutate(date = dmy(date)) %>% 
  collect()
#> # A tibble: 1 × 1
#>   date  
#>   <date>
#> 1 NA

arrow_table(date = "12JAN2004") %>% 
  collect() %>% 
  mutate(date = dmy(date))
#> # A tibble: 1 × 1
#>   date      
#>   <date>    
#> 1 2004-01-12

Created on 2024-01-23 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.1 (2023-06-16 ucrt) #> os Windows 10 x64 (build 19045) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_Australia.utf8 #> ctype English_Australia.utf8 #> tz Australia/Adelaide #> date 2024-01-23 #> pandoc 3.1.1 @ C:/Users/gamb0043/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> arrow * 14.0.0.2 2023-12-02 [1] CRAN (R 4.3.2) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.3.1) #> bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.1) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.1) #> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.2) #> digest 0.6.34 2024-01-11 [1] CRAN (R 4.3.2) #> dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.1) #> evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.2) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.2) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.1) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.1) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.0.2) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.2) #> htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.2) #> knitr 1.45 2023-10-30 [1] CRAN (R 4.3.2) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.2) #> lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.1) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.0.5) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.0.2) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.1) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.1) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.2) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.1) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.1) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.1) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.1) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.1) #> rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.2) #> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1) #> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.1) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.2) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.1) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.0.2) #> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.1) #> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.1) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.2) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.2) #> withr 2.5.2 2023-10-30 [1] CRAN (R 4.3.2) #> xfun 0.41 2023-11-01 [1] CRAN (R 4.3.2) #> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.2) #> #> [1] C:/Users/gamb0043/R #> [2] C:/Program Files/R/R-4.3.1/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

Component(s)

R

assignUser commented 9 months ago

Hey, thanks for the report.

I can't reproduce with the same R and arrow version though I am on linux so it might be an issue with tzdata on windows... do other conversions work correctly?

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:arrow':
#> 
#>     duration
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

arrow_table(date = "12JAN2004") %>%
  mutate(date = dmy(date)) %>% 
  collect()
#> # A tibble: 1 × 1
#>   date      
#>   <date>    
#> 1 2004-01-12

arrow_table(date = "12JAN2004") %>% 
  collect() %>% 
  mutate(date = dmy(date))
#> # A tibble: 1 × 1
#>   date      
#>   <date>    
#> 1 2004-01-12
hamgamb commented 9 months ago

From others I've spoken to, this isn't reproducible on Mac either. When you say other conversions, do you mean same string format, different methods (as below?)

library(arrow)
#> Warning: package 'arrow' was built under R version 4.3.2
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:arrow':
#> 
#>     duration
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

arrow_table(date = "12JAN2004") |> 
  mutate(date = dmy(date)) |> 
  collect()
#> # A tibble: 1 × 1
#>   date  
#>   <date>
#> 1 NA

arrow_table(date = "12JAN2004") |> 
  mutate(date = as.Date(date, format = "%d%B%Y")) |> 
  collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! Invalid: Failed to parse string: '12JAN2004' as a scalar of type timestamp[s]
#> Backtrace:
#>     ▆
#>  1. ├─dplyr::collect(...)
#>  2. └─arrow:::collect.arrow_dplyr_query(...)
#>  3.   └─arrow:::compute.arrow_dplyr_query(x)
#>  4.     └─base::tryCatch(...)
#>  5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  6.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  7.           └─value[[3L]](cond)
#>  8.             └─arrow:::augment_io_error_msg(e, call, schema = schema())
#>  9.               └─rlang::abort(msg, call = call)

arrow_table(date = "12JAN2004") |> 
  mutate(date = as_date(date, format = "%d%B%Y")) |> 
  collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! Invalid: Failed to parse string: '12JAN2004' as a scalar of type timestamp[s]
#> Backtrace:
#>     ▆
#>  1. ├─dplyr::collect(...)
#>  2. └─arrow:::collect.arrow_dplyr_query(...)
#>  3.   └─arrow:::compute.arrow_dplyr_query(x)
#>  4.     └─base::tryCatch(...)
#>  5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  6.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  7.           └─value[[3L]](cond)
#>  8.             └─arrow:::augment_io_error_msg(e, call, schema = schema())
#>  9.               └─rlang::abort(msg, call = call)

Created on 2024-01-23 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.1 (2023-06-16 ucrt) #> os Windows 10 x64 (build 19045) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_Australia.utf8 #> ctype English_Australia.utf8 #> tz Australia/Adelaide #> date 2024-01-23 #> pandoc 3.1.1 @ C:/Users/gamb0043/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> arrow * 14.0.0.2 2023-12-02 [1] CRAN (R 4.3.2) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.3.1) #> bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.1) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.1) #> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.2) #> digest 0.6.34 2024-01-11 [1] CRAN (R 4.3.2) #> dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.1) #> evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.2) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.2) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.1) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.1) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.0.2) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.2) #> htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.2) #> knitr 1.45 2023-10-30 [1] CRAN (R 4.3.2) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.2) #> lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.1) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.0.5) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.0.2) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.1) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.1) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.2) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.1) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.1) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.1) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.1) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.1) #> rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.2) #> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1) #> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.1) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.2) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.1) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.0.2) #> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.1) #> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.1) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.2) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.2) #> withr 2.5.2 2023-10-30 [1] CRAN (R 4.3.2) #> xfun 0.41 2023-11-01 [1] CRAN (R 4.3.2) #> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.2) #> #> [1] C:/Users/gamb0043/R #> [2] C:/Program Files/R/R-4.3.1/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
thisisnic commented 9 months ago

What's happening here is that "12JAN2004" is in the format which lubridate refers to as dBY or dbY (see ?lubridate::parse_date_time for the full spec). The dmy() binding in the arrow package is a wrapper around the parse_date_time() binding.

In our docs for parse_date_time(), it notes that "parse_date_time(): quiet = FALSE is not supported Available formats are H, I, j, M, S, U, w, W, y, Y, R, T. On Linux and OS X additionally a, A, b, B, Om, p, r are available."

Therefore on Windows, we wouldn't expect that code to work as b and B are not supported. I assume this is due to tzdata on Windows (which will be doing the parsing in the background) as noted by @assignUser.

Therefore, this isn't a bug, but is expected behaviour. We should update our docs to make it easier to find this information though, as it's not immediately obvious to find without knowing that dmy() calls parse_date_time().

Thanks for reporting this @hamgamb!