brodieG / unitizer

Easy R Unit Tests
Other
39 stars 6 forks source link

Deparse Depends on Locales For Strings #289

Closed brodieG closed 2 years ago

brodieG commented 2 years ago

For example, we see in an ISO-8859:

nchar_ctl("\033\200")

vs.

nchar_ctl("\033\x80")

Probably because the "\x80" is a non-character in 8859 vs latin-1.

brodieG commented 2 years ago

There is not a good solution to this. We contemplated trying to add a parse/deparse cycle on loading to ensure the same deparse, but the original deparse could be invalid if we attempt to do the parse/deparse in a different locale (if the raw bytes are valid in the reference locales, those bytes are output by the deparser as themselves, but the parser will refuse to ingest them if they are invalid in the new locale).

Potentially we could use the new version of RDS files that records the locale, in which case those literals are translated from whatever locale they were in originally to the new locale, but whether that was the intent or not is highly questionable, particularly since there is no way to directly mark a literal with its encoding (Unicode escapes kind of do it).

So probably we just need to make sure this is clearly documented.

brodieG commented 2 years ago

Documented as a "fix"