eddelbuettel / rcpptoml

Rcpp Bindings to C++ parser for TOML files
GNU General Public License v2.0
36 stars 9 forks source link

Problems with encoding (with non-ASCII chars) #28

Closed vh-d closed 6 years ago

vh-d commented 6 years ago

I run into encoding issues when parsing files as well as R characters.

Parsing from text (R characters):

> test_input <- enc2utf8("value='Žluťoučký kůň'")
> Encoding(test_input)
[1] "UTF-8"
> test_output <- RcppTOML::parseTOML(test_input, fromFile = FALSE)$value
> cat(test_output)
Žluťoučký kůň
> Encoding(test_output)
[1] "unknown"

The encoding attribute is lost. But may be set again

> Encoding(test_output) <- "UTF-8"
> cat(test_output)
Žluťoučký kůň

Parsing from files

Example: test.txt

> test_file <- RcppTOML::parseTOML("test.txt
")
> test_file$value
[1] "Ĺ˝luĹĄouÄŤkĂ˝ kĹŻĹ\u0088"
> Encoding(test_file$value)
[1] "unknown"
> Encoding(test_file$value) <- "UTF-8"
> test_file$value
[1] "Žluťoučký kůň"

TOML files are assumed to be UTF-8 Unicode texts. However R characters obtained from parsing via parseTOML() are labeled as "unknown" encoding.

In case of files, the solution may be relatively easy, I think. We can assume that input is UTF-8 and label every string output as "UTF-8".

eddelbuettel commented 6 years ago

Just to double check: what platform are you on?

vh-d commented 6 years ago

Good point, this was Windows 2008. I will check Linux in few hours and report back.

eddelbuettel commented 6 years ago

In a way that's good -- Windows seems to be the worst. So if Encoding() covers it there we should be good on the others (who may already be good).

vh-d commented 6 years ago

On Linux:

Parsing character value:

> test_input <- enc2utf8("value='Žluťoučký kůň'")
> Encoding(test_input)
[1] "UTF-8"
> test_output <- RcppTOML::parseTOML(test_input, fromFile = FALSE)$value
> cat(test_output)
Žluťoučký kůň
> Encoding(test_output)
[1] "unknown"
> Encoding(test_output) <- "UTF-8"
> cat(test_output)
Žluťoučký kůň

and parsing files:

> test_file <- RcppTOML::parseTOML("~/Downloads/test.txt")
> test_file$value
[1] "Žluťoučký kůň"
> Encoding(test_file$value)
[1] "unknown"
> Encoding(test_file$value) <- "UTF-8"
> test_file$value
[1] "Žluťoučký kůň"

So the problem persists (lost encoding attribute) but the consequences are negligible because "unknown" encoding on modern linux is "UTF-8" anyway.

vh-d commented 6 years ago

I suggest

The first would solve non-file input. The second would solve both file and non-file inputs given that user is assumed to provide UTF-8 files by TOML spec.

eddelbuettel commented 6 years ago

Sounds good. Also see #20 which is pretty much the same, no?

vh-d commented 6 years ago

Yeah, sorry about that. I will try to come up with a pull request soon as this bug bites my application quite a bit.

eddelbuettel commented 6 years ago

That's ok. You are being careful, and you are constructing good examples. One change at a time...

eddelbuettel commented 6 years ago

Ok, I just pushed a PR with a change I had following the 0.1.4 release. I'd like to make one more change in there and properly document your last change -- see ChangeLog which is a standard (older) format well supported by Emacs :) Can you drop me a full name please, either here or if you prefer by email? And if you want an email different from the one used by git log. Thanks! The 0.1.5 release some time next week will be much improved thanks to your help.

vh-d commented 6 years ago

Can you drop me a full name please, either here if you prefer by email? And if you want an email different from the one used by git log.

My name is Václav Hausenblas, nice to meet you :-) Big fan of tinyverse, btw.

Thanks! The 0.1.5 release some time next week will be much improved thanks to your help.

My pleasure! I am going to play with encoding now...

eddelbuettel commented 6 years ago

Ok, you're in the ChangeLog now :) And I got my other issue taken care of -- the (new in 0.5.0) local_time type comes back to us now too (as a string, there is no real type for it and I don't think I want to pull in hms just for this).

So if/when you something for encoding feel free to branch or fork again and show it :)

eddelbuettel commented 6 years ago

Fixed in #30