Persistant Windows encoding issue causes CI to fail

jdhoffa commented 8 months ago

          @jdhoffa any idea why a test is failing for some windows checks? the errors seem to point to an issue that we thought was closed: https://github.com/RMI-PACTA/r2dii.match/issues/425

Originally posted by @jacobvjk in https://github.com/RMI-PACTA/r2dii.match/pull/435#pullrequestreview-1821191363

AlexAxthelm commented 8 months ago

@jdhoffa can you point to the failing test?

jdhoffa commented 8 months ago

It's no longer a failing test in this case, but rather a failing check on CI.

See, for example, here: https://github.com/RMI-PACTA/r2dii.match/actions/runs/7505615529/job/20435241421

All Windows-related CI is failing. Can't reproduce locally of course, as it's Windows only.

AlexAxthelm commented 8 months ago

\xfc is the ü character. I would argue that the windows check is interpreting the string you're testing (oüdd) correctly, and it's all the other tests that are giving false positives.

One option would be to iconv() everything into UTF-8 early in the process, but do note that there are issues with iconv when running on MacOS Sonoma. (See: https://github.com/RMI-PACTA/pacta.portfolio.import/pull/57)

x <- "o_\xfc_dd"
y <- iconv(x, from = "latin1", "utf-8")
print(y)
#> [1] "o_ü_dd"
iconv(x, from = "latin1", "ascii")
#> [1] NA
z <- iconv(y, "utf-8", "ascii")
print(z)
#> [1] NA

^{Created on 2024-01-16 with reprex v2.0.2}

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.2 (2023-10-31) #> os macOS Sonoma 14.2 #> system aarch64, darwin23.0.0 #> ui unknown #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Belgrade #> date 2024-01-16 #> pandoc 3.1.7 @ /opt/homebrew/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.1) #> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.1) #> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.1) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.1) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.1) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.1) #> htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.1) #> knitr 1.43 2023-05-25 [1] CRAN (R 4.3.1) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.1) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.1) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.1) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.1) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.1) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.1) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.1) #> rlang 1.1.2 2023-11-04 [1] CRAN (R 4.3.1) #> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.1) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.1) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1) #> withr 2.5.2 2023-10-30 [1] CRAN (R 4.3.1) #> xfun 0.40 2023-08-09 [1] CRAN (R 4.3.1) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.1) #> #> [1] /opt/homebrew/lib/R/4.3/site-library #> [2] /opt/homebrew/Cellar/r/4.3.2/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

cjyetman commented 8 months ago

The latin1 (also windows-1252) byte for ü is \xfc, but the utf-8 bytes for ü is \xc3\xbc. In R, "\xfc" is a "string literal", and "\x..." is a hex escape that the R parser auto-converts to a byte. By default, a string literal will be assigned an encoding of "unknown", so how it is printed on the console will be dependent on your R locale. Since most linux and macOS R sessions run with a utf-8-ish console, you'll probably at least need to specify the encoding as latin1 on the string literal in that test so that it is interpreted correctly. e.g.

"o_\xfc_dd"
#> [1] "o_\xfc_dd"
Encoding("o_\xfc_dd")
#> [1] "unknown"
`Encoding<-`("o_\xfc_dd", "latin1")
#> [1] "o_ü_dd"

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.2 (2023-10-31) #> os macOS Sonoma 14.2.1 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Berlin #> date 2024-01-16 #> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.1) #> digest 0.6.34 2024-01-11 [1] CRAN (R 4.3.1) #> evaluate 0.23 2023-11-01 [1] RSPM #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.0) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.1) #> htmltools 0.5.7 2023-11-03 [1] RSPM (R 4.3.0) #> knitr 1.45 2023-10-30 [1] RSPM (R 4.3.0) #> lifecycle 1.0.4 2023-11-07 [1] RSPM (R 4.3.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0) #> R.oo 1.25.0 2022-06-12 [1] RSPM #> R.utils 2.12.3 2023-11-18 [1] CRAN (R 4.3.1) #> reprex 2.1.0 2024-01-11 [1] CRAN (R 4.3.1) #> rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.1) #> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1) #> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.0) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1) #> withr 2.5.2 2023-10-30 [1] RSPM (R 4.3.0) #> xfun 0.41 2023-11-01 [1] RSPM #> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.1) #> #> [1] /Users/cjrmi/Library/R/arm64/4.3/library #> [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

jdhoffa commented 8 months ago

Thanks for the info guys!! Much appreciated as I am not an encoding connoisseur

jdhoffa commented 7 months ago

Hmm this seems to be a question of what the expected/ desired behaviour is of to_alias (which honestly I'm not even sure).

In #425, @maurolepore was expecting that the alias of "oüdd" is "odd". I would expect that the alias of "oüdd" is "oudd". I will dig in to see if there is a reason that we remove the accented vowel entirely, or if translating it into an unaccented vowel is the way to go.

In the mean time, @maurolepore if you have any context here, and in relation to #425 that would be appreciated!

cjyetman commented 7 months ago

Hmm this seems to be a question of what the expected/ desired behaviour is of to_alias (which honestly I'm not even sure).

In #425, @maurolepore was expecting that the alias of "oüdd" is "odd". I would expect that the alias of "oüdd" is "oudd". I will dig in to see if there is a reason that we remove the accented vowel entirely, or if translating it into an unaccented vowel is the way to go.

Whether intended or not, I believe the current behavior is that "ü" (if encoded in a non-UTF-8 encoding but not marked as such) is un-transliterable and is therefore removed from the string. I kinda doubt that was intentional.

jdhoffa commented 7 months ago

Makes sense! I doubt it was intentional as well. Just curious how it made it's way into the test. I'll check the git blame

RMI-PACTA / r2dii.match

Persistant Windows encoding issue causes CI to fail #436