RMI-PACTA / pacta.portfolio.import

Functions that validate input “listed security” portfolios.
https://rmi-pacta.github.io/pacta.portfolio.import/
Other
2 stars 0 forks source link

encoding test crashes R session on macOS #54

Closed cjyetman closed 9 months ago

cjyetman commented 11 months ago

This test https://github.com/RMI-PACTA/pacta.portfolio.import/blob/13beda3b52afc6e53ff644b1d454001c596dd79b/tests/testthat/test-read_portfolio_csv.R#L121-L138

causes a hard error and crash of the R session with

Assertion failed: (tmpin > *in), function _citrus_iconv_std_iconv_convert, file citrus_iconv_std.c, line 1059.

Exited with status 134.

I tracked down the root of this to this line https://github.com/RMI-PACTA/pacta.portfolio.import/blob/13beda3b52afc6e53ff644b1d454001c596dd79b/R/has_newline_at_end.R#L26

which intends to convert the text into UTF-8, and if any character is unconvertible replace it with an empty string

That line was added in https://github.com/RMI-PACTA/pacta.portfolio.import/pull/49 to avoid an error with grepl not receiving valid UTF-8 text

This appears to affect my machine (macOS 14.1.1 / arm64), but not a RStudio Cloud instance (some linux flavor), or our GitHub Action runners. There seems to be a bug in R base's iconv() or the underlying system libraries that it may or may not use, that causes a crash of the R session if in the string being processed a quotation mark is followed by an extended character...

txt <- 'Über text'
txt

win_txt <- iconv(txt, to = "ISO-8859-1")
win_txt

iconv(win_txt, to = "UTF-8")

iconv(win_txt, to = "UTF-8", sub = "")

txt <- '"Über" text'
txt

win_txt <- iconv(txt, to = "ISO-8859-1")
win_txt

iconv(win_txt, to = "UTF-8")

# THIS WILL CRASH THE R SESSION
# iconv(win_txt, to = "UTF-8", sub = "")
AlexAxthelm commented 11 months ago

Cannot Replicate.

txt <- '"Über" text'
txt
#> [1] "\"Über\" text"

win_txt <- iconv(txt, to = "ISO-8859-1")
win_txt
#> [1] "\"\xdcber\" text"

iconv(win_txt, to = "UTF-8")
#> [1] NA

# This does not crash my system.
iconv(win_txt, to = "UTF-8", sub = "")
#> [1] "\"ber\" text"

Created on 2023-11-28 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.1 (2023-06-16) #> os macOS Ventura 13.5.2 #> system aarch64, darwin22.4.0 #> ui unknown #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Belgrade #> date 2023-11-28 #> pandoc 3.1.7 @ /opt/homebrew/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.1) #> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.1) #> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.1) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.1) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.1) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.1) #> htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.1) #> knitr 1.43 2023-05-25 [1] CRAN (R 4.3.1) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.1) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.1) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.1) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.1) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.1) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.1) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.1) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.1) #> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.1) #> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.1) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.1) #> vctrs 0.6.4 2023-10-12 [1] CRAN (R 4.3.1) #> withr 2.5.1 2023-09-26 [1] CRAN (R 4.3.1) #> xfun 0.40 2023-08-09 [1] CRAN (R 4.3.1) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.1) #> #> [1] /opt/homebrew/lib/R/4.3/site-library #> [2] /opt/homebrew/Cellar/r/4.3.1/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
cjyetman commented 11 months ago

weird... posting my session info for comparison

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.1 (2023-06-16)
#>  os       macOS Sonoma 14.1.1
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Berlin
#>  date     2023-11-29
#>  pandoc   3.1.8 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
#>  digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)
#>  evaluate      0.23    2023-11-01 [1] RSPM
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
#>  fs            1.6.3   2023-07-20 [1] CRAN (R 4.3.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
#>  htmltools     0.5.7   2023-11-03 [1] RSPM (R 4.3.0)
#>  knitr         1.45    2023-10-30 [1] RSPM (R 4.3.0)
#>  lifecycle     1.0.4   2023-11-07 [1] RSPM (R 4.3.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
#>  purrr         1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
#>  R.cache       0.16.0  2022-07-21 [2] CRAN (R 4.3.0)
#>  R.methodsS3   1.8.2   2022-06-13 [2] CRAN (R 4.3.0)
#>  R.oo          1.25.0  2022-06-12 [2] CRAN (R 4.3.0)
#>  R.utils       2.12.3  2023-11-18 [2] RSPM (R 4.3.0)
#>  reprex        2.0.2   2022-08-17 [2] CRAN (R 4.3.0)
#>  rlang         1.1.2   2023-11-04 [1] RSPM (R 4.3.0)
#>  rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
#>  rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
#>  styler        1.10.2  2023-08-29 [2] RSPM (R 4.3.1)
#>  vctrs         0.6.4   2023-10-12 [1] RSPM (R 4.3.0)
#>  withr         2.5.2   2023-10-30 [1] RSPM (R 4.3.0)
#>  xfun          0.41    2023-11-01 [1] RSPM
#>  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)
#> 
#>  [1] /Users/cjrmi/Library/R/arm64/4.3/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
cjyetman commented 11 months ago

seems like the most significant, relevant difference is..

#>  os       macOS Ventura 13.5.2
#>  system   aarch64, darwin22.4.0

versus

#>  os       macOS Sonoma 14.1.1
#>  system   aarch64, darwin20

curious what will happen once you update

AlexAxthelm commented 11 months ago

Hmm. Yeah. this looks like it might be a Sonoma issue. https://groups.google.com/g/r-sig-mac/c/pFuJu7F33Eg/m/xdrBmg-PDwAJ

cjyetman commented 11 months ago

We could (temporarily?) put a testthat::skip_on_os(os = "mac") in that specific test so I can test locally until this is resolved?

AlexAxthelm commented 11 months ago

I would suggest making it perhaps a bit more specific than that, and use something like skip_if(Sys.info()$sysname == "Darwin" && Sys.info()$release == "23.2.0")

cjyetman commented 11 months ago

I'm tempted now not to skip the test, because the function has_newline_at_end() will still hard crash R on >= macOS 14, which is the real problem here.

AlexAxthelm commented 11 months ago

I've upgraded to Sonoma, but it looks like R 4.3.2 clears this issue up:

txt <- '"Über" text'
txt
#> [1] "\"Über\" text"

win_txt <- iconv(txt, to = "ISO-8859-1")
win_txt
#> [1] "\"\xdcber\" text"

iconv(win_txt, to = "UTF-8")
#> [1] NA

# This does not crash my system.
iconv(win_txt, to = "UTF-8", sub = "")
#> [1] "\"ber\" text"

Created on 2023-12-20 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.2 (2023-10-31) #> os macOS Sonoma 14.2 #> system aarch64, darwin23.0.0 #> ui unknown #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Belgrade #> date 2023-12-20 #> pandoc 3.1.7 @ /opt/homebrew/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.1) #> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.1) #> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.1) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.1) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.1) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.1) #> htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.1) #> knitr 1.43 2023-05-25 [1] CRAN (R 4.3.1) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.1) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.1) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.1) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.1) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.1) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.1) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.1) #> rlang 1.1.2 2023-11-04 [1] CRAN (R 4.3.1) #> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.1) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.1) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1) #> withr 2.5.2 2023-10-30 [1] CRAN (R 4.3.1) #> xfun 0.40 2023-08-09 [1] CRAN (R 4.3.1) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.1) #> #> [1] /opt/homebrew/lib/R/4.3/site-library #> [2] /opt/homebrew/Cellar/r/4.3.2/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
cjyetman commented 11 months ago

Likewise for me. That's good news, but... raising the bar to >= R v4.3.2 might be a bit extreme.

AlexAxthelm commented 11 months ago

Yeah. Probably a bit extreme. Maybe we put a check in .onLoad and emit a warning if R < 4.3.2 on Sonoma or higher?

cjyetman commented 11 months ago

I think ideally I should find a more robust way of testing for a newline at the end of the file... maybe using system's wc -l or something?

AlexAxthelm commented 11 months ago

Hmm. I put #57 in to resolve the particular issue, but if we're only using it to check for newline chars, then yeah, I'm thinking that this might be me building a better mousetrap

AlexAxthelm commented 11 months ago

The entire point of this function is to check if there's a trailing newline at the end of the CSV?

cjyetman commented 11 months ago

The entire point of this function is to check if there's a trailing newline at the end of the CSV?

Yes, because there was a time when read.csv(), readr::read_csv(), and/or vroom::vroom() could not read in a CSV file that did not end with a newline without a crash or failure.