RacimoLab / demes-r

R library for parsing Demes demographic models
Other
0 stars 2 forks source link

utf encoding issue with output from the reference parser #11

Open grahamgower opened 1 year ago

grahamgower commented 1 year ago

The reference parser output for the valid test case unicode_deme_name_04.yaml from the demes-spec repo is encoded as utf16, when it should be encoded as utf8.

grahamgower commented 1 year ago

See #9.

grahamgower commented 1 year ago

@IsabelMarleen I think we must had a similar problem when using the reference parser in the demes-python test suite (but only in the continuous integration when running on Windows). It turned out to be an issue with Python choosing the encoding for stdout to match the OS-configured locale, which on Windows was utf16 by default. The solution was to call python with the -X utf8 option to override the default encoding. https://github.com/popsim-consortium/demes-python/blob/392c6a0eb5e70223a00d6659df2134317a94bdf0/tests/test_spec.py#L33-L34

I guess you're using a locale on your computer, for which the default encoding is utf16? Could you try adding the -X utf8 option when calling the reference parser here: https://github.com/RacimoLab/demes-r/blob/f484f43d9d4e194dc11f1e7938d74aa3fd8dcf22/tests/testthat/helper-functions.R#L43

grahamgower commented 1 year ago

Some discussions about encoding here: https://github.com/popsim-consortium/demes-spec/issues/129

IsabelMarleen commented 1 year ago

I tried it just now and it did not make a difference. I ran python3 reference_implementation/resolve_yaml.py test-cases/valid/unicode_deme_name_04.yaml -X utf8 > tmp.json and in the output the property in question looks like "name": "\ud867\ude3d". When trying to read tmp.json with yaml::read_yaml() I get the following error:

Error in yaml.load(string, error.label = error.label, ...) : (tmp.json) Scanner error: while parsing a quoted scalar at line 9, column 15 found invalid Unicode character escape code at line 9, column 18

Without specifying -X utf8, the yaml parser worked when I specified fileEncoding=UTF-16, but that throws a different error now. The scanner error is the same I encountered before, however.

grahamgower commented 1 year ago

What operating system are you using?

IsabelMarleen commented 1 year ago

I'm using macOS.