sjcl.codec.utf8String.fromBits() fails for certain valid inputs

hiddentao commented 10 years ago

I just generated 512 bytes using sjcl.random to get the following array:

[
1622257303,775647481,-496345452,-1065731224,938938801,-868508255,1182296851,2075930857,-792325703,1093833160,-508421808,1671982444,2141504986,-1300002288,1828344264,-532030464,740978242,-1497102173,232566574,1072969131,-2126784285,-535958573,1015035333,-832805901,1883314115,1150248694,-1050503930,1577776106,70432378,257529303,-1746559340,-950319918,1693838903,-1250282989,-1674319739,713903827,1140023625,-1582577601,1120387352,-1356257844,-1326845606,1007890998,-1046079322,699734250,1260199734,2080253321,1695392120,-1919347056,-1060714667,822261715,1612715415,456962062,1463351678,1702269005,-1375065178,-158540089,-1873872142,918002847,1873753978,458509330,-1886461037,1364972118,1800807947,1497633483,1499638961,-541041457,575078314,-646248125,-616139408,-1258371659,-1210283235,-426131835,-1642869198,1517217351,1332754895,-502811651,1503941665,652072098,817557306,-95231440,-469293222,-1836320109,-686552620,-456718825,-721873954,-692893785,160989394,819403313,-209030384,439571513,512861205,-932139664,112711185,1314557284,1589427245,-311985839,-1165386670,-470296286,812431969,-1973474859,1593137088,-862616718,-185862776,-1276603339,1173469361,-1074349449,-620807694,53385495,290318359,-812801716,-753845466,1411363659,-788250947,-491962853,-725438570,-524366490,-1658139756,711293111,563009880,-1986405713,1844478627,-1243027816,-2038286344,693629141,-951110529,-633242892,1408178737,-666937540
]

If I pass this to sjcl.codec.utf8String.fromBits I get the following error:

URIError: URI malformed
    at decodeURIComponent (native)
    at Object.sjcl.codec.utf8String.fromBits

The escape and unescape methods are deprecated in JS. Actually, why are we encoding/decoding for URI use in the first place? If needed why not split these out into separate methods?

Nilos commented 10 years ago

Guessed: to protect developers when transfering those utf8-data in e.g. a url-parameter. But for the correct answer you will have to wait for @bitwiseshiftleft.

stbuehler commented 10 years ago

Not many random bit strings are valid utf-8. The error message could be improved though...

hiddentao commented 10 years ago

In the end I decided to write a version of sjcl.codec.utf8String.fromBits which didn't call escape/encodeURIComponent.

stbuehler commented 10 years ago

such interpretation is a "fixed-8-bit encoding" of unicode which obviously can only represent unicode codepoints below 0x100. I see no value in such encoding.

If you need to encode random binary data as some string I'd recommend base64; your version works too ofc (just that there is no "name" for such encoding afaik).

It'd be nice if you could either clarify what you expect from this issue now (as it is clear that sjcl.codec.utf8String.fromBits is correct after all) or close it :)

hiddentao commented 10 years ago

My point was simply that utf8String.fromBits should do what it says on the tin and give me a string from the raw bits. Of course I could use base64 but the above method would be quicker. Calling escape on the final string is problematic and I was simply asking why this was being done.

stbuehler commented 10 years ago

Example: You start with hex input "C3 A4", and convert it into a bit array. then utf8String.fromBits builds a string from this using String.fromCharCode(0xC3) + String.fromCharCode(0xA4) ("Ã¤"), runs escape on it ("%C3%A4") and then decodeURIComponent, leading to the final string "ä".

In other words: escape/unescapes treats the string as a "fixed-8-bit" encoding (doesn't matter which), and decodeURIComponent decodes into a utf-8 string.

The combination of those two is a nice way to decode utf-8 "bytes" into unicode codepoints; otherwise one would have to parse utf-8 manually.

escape/unescape are deprecated because they usually do not what a user actually wants; but in this case they do.

Update: decodeURIComponent(escape(String.fromCharCode(0xC3))) for example fails; but utf-8 decoding the hex string "C3" would fail too, because "C3" is just not a valid utf-8 sequence.

hiddentao commented 10 years ago

In that case I think it's worth adding some documentation (or atleast a link to this issue) next to that function so that it's still clear to new users. I can raise a PR if you agree.

stbuehler commented 10 years ago

How about adding a (user visible) comment that decoding can fail as not all bit/byte strings are valid utf-8 strings? One could link some UTF-8 specs perhaps.

One could also add an inline comment for the usage of escape.

I'm not a maintainer but I see nothing wrong raising a PR for that.

stbuehler commented 10 years ago

Ah, another thing that might help you: without the decodeURIComponent(escape(...)) conversion you get the ISO-8859-1 decoding (i.e. the function transforms an ISO-8859-1 encoded bit string into a javascript string), so the "fixed-8-bit unicode" encoding is actually ISO-8859-1.

bitwiseshiftleft / sjcl

sjcl.codec.utf8String.fromBits() fails for certain valid inputs #131