Closed jdhoffa closed 7 months ago
@jdhoffa can you point to the failing test?
It's no longer a failing test in this case, but rather a failing check on CI.
See, for example, here: https://github.com/RMI-PACTA/r2dii.match/actions/runs/7505615529/job/20435241421
All Windows-related CI is failing. Can't reproduce locally of course, as it's Windows only.
\xfc
is the ü
character. I would argue that the windows check is interpreting the string you're testing (oüdd
) correctly, and it's all the other tests that are giving false positives.
One option would be to iconv()
everything into UTF-8 early in the process, but do note that there are issues with iconv
when running on MacOS Sonoma. (See: https://github.com/RMI-PACTA/pacta.portfolio.import/pull/57)
x <- "o_\xfc_dd"
y <- iconv(x, from = "latin1", "utf-8")
print(y)
#> [1] "o_ü_dd"
iconv(x, from = "latin1", "ascii")
#> [1] NA
z <- iconv(y, "utf-8", "ascii")
print(z)
#> [1] NA
Created on 2024-01-16 with reprex v2.0.2
The latin1 (also windows-1252) byte for ü is \xfc, but the utf-8 bytes for ü is \xc3\xbc. In R, "\xfc"
is a "string literal", and "\x..."
is a hex escape that the R parser auto-converts to a byte. By default, a string literal will be assigned an encoding of "unknown"
, so how it is printed on the console will be dependent on your R locale. Since most linux and macOS R sessions run with a utf-8-ish console, you'll probably at least need to specify the encoding as latin1 on the string literal in that test so that it is interpreted correctly. e.g.
"o_\xfc_dd"
#> [1] "o_\xfc_dd"
Encoding("o_\xfc_dd")
#> [1] "unknown"
`Encoding<-`("o_\xfc_dd", "latin1")
#> [1] "o_ü_dd"
Thanks for the info guys!! Much appreciated as I am not an encoding connoisseur
Hmm this seems to be a question of what the expected/ desired behaviour is of to_alias
(which honestly I'm not even sure).
In #425, @maurolepore was expecting that the alias of "oüdd" is "odd". I would expect that the alias of "oüdd" is "oudd". I will dig in to see if there is a reason that we remove the accented vowel entirely, or if translating it into an unaccented vowel is the way to go.
In the mean time, @maurolepore if you have any context here, and in relation to #425 that would be appreciated!
Hmm this seems to be a question of what the expected/ desired behaviour is of
to_alias
(which honestly I'm not even sure).In #425, @maurolepore was expecting that the alias of "oüdd" is "odd". I would expect that the alias of "oüdd" is "oudd". I will dig in to see if there is a reason that we remove the accented vowel entirely, or if translating it into an unaccented vowel is the way to go.
Whether intended or not, I believe the current behavior is that "ü" (if encoded in a non-UTF-8 encoding but not marked as such) is un-transliterable and is therefore removed from the string. I kinda doubt that was intentional.
Makes sense! I doubt it was intentional as well. Just curious how it made it's way into the test. I'll check the git blame
Originally posted by @jacobvjk in https://github.com/RMI-PACTA/r2dii.match/pull/435#pullrequestreview-1821191363