att / rcloud

Collaborative data analysis and visualization
http://rcloud.social
MIT License
431 stars 141 forks source link

strings with undeclared encoding that are not UTF8 bomb rserve.js #1653

Open gordonwoodhull opened 9 years ago

gordonwoodhull commented 9 years ago

Invalid UTF8 characters in the range 0x81-0xFF leave Rserve unchanged (even when rserve.conf specifies encoding utf8) and then break rserve.js, reporting the confusing "URI malformed" because it is using decodeURIComponent

Example:

password("P\xe1ssword")

This comes up when reading the TIGER/Lion data set, which has plenty of Spanish county names.

Originally reported here: https://github.com/att/rserve-js/issues/2

s-u commented 9 years ago

This has nothing to do with latin characters, but rather strings with undeclared encoding. If you use latin1 is works just fine:

> a="P\xe1ssword"
> Encoding(a)
[1] "unknown"
> Encoding(a)="latin1"
> a
[1] "Pássword"   

The real problem is that whatever code reads TIGER/Line doesn't declare the encoding properly.

That said, we need to do something to not bomb on those. Since they are useless anyway (the content is really undefined) we may as well simply take them as bytes and encode them by casting each byte to Uint16 and treat them as unicode subsequently encoding to UTF-8. The issue with that is that the JS side has no way to distinguish such undefined strings from valid UTF-8. The other alternative is to pass them as byte arrays.

s-u commented 9 years ago

I checked R behavior and it's a bit more complicated than that. Normally, undeclared encoding is taken to be the current locale. Since we're running in a UTF8 locale a="\xe1" is entirely illegal, since that is not a valid UTF8 string. Unfortunately, R will let it pass so a string in the native locale may not be actually valid. Strictly speaking, this is a user error, so bailing out at some point is the right answer. The problem is that the only way to detect such strings is to do a full UTF8-validity check on every string we pass which seems like a fairly heavy penalty for detecting edge cases.