Open hiddentao opened 11 years ago
Guessed: to protect developers when transfering those utf8-data in e.g. a url-parameter. But for the correct answer you will have to wait for @bitwiseshiftleft.
Not many random bit strings are valid utf-8. The error message could be improved though...
In the end I decided to write a version of sjcl.codec.utf8String.fromBits
which didn't call escape/encodeURIComponent.
such interpretation is a "fixed-8-bit encoding" of unicode which obviously can only represent unicode codepoints below 0x100. I see no value in such encoding.
If you need to encode random binary data as some string I'd recommend base64; your version works too ofc (just that there is no "name" for such encoding afaik).
It'd be nice if you could either clarify what you expect from this issue now (as it is clear that sjcl.codec.utf8String.fromBits
is correct after all) or close it :)
My point was simply that utf8String.fromBits
should do what it says on the tin and give me a string from the raw bits. Of course I could use base64 but the above method would be quicker. Calling escape
on the final string is problematic and I was simply asking why this was being done.
Example: You start with hex input "C3 A4"
, and convert it into a bit array. then utf8String.fromBits
builds a string from this using String.fromCharCode(0xC3) + String.fromCharCode(0xA4)
("ä"
), runs escape
on it ("%C3%A4"
) and then decodeURIComponent
, leading to the final string "ä"
.
In other words: escape
/unescapes
treats the string as a "fixed-8-bit" encoding (doesn't matter which), and decodeURIComponent
decodes into a utf-8 string.
The combination of those two is a nice way to decode utf-8 "bytes" into unicode codepoints; otherwise one would have to parse utf-8 manually.
escape
/unescape
are deprecated because they usually do not what a user actually wants; but in this case they do.
Update: decodeURIComponent(escape(String.fromCharCode(0xC3)))
for example fails; but utf-8 decoding the hex string "C3"
would fail too, because "C3"
is just not a valid utf-8 sequence.
In that case I think it's worth adding some documentation (or atleast a link to this issue) next to that function so that it's still clear to new users. I can raise a PR if you agree.
How about adding a (user visible) comment that decoding can fail as not all bit/byte strings are valid utf-8 strings? One could link some UTF-8 specs perhaps.
One could also add an inline comment for the usage of escape
.
I'm not a maintainer but I see nothing wrong raising a PR for that.
Ah, another thing that might help you: without the decodeURIComponent(escape(...))
conversion you get the ISO-8859-1 decoding (i.e. the function transforms an ISO-8859-1 encoded bit string into a javascript string), so the "fixed-8-bit unicode" encoding is actually ISO-8859-1.
I just generated 512 bytes using sjcl.random to get the following array:
If I pass this to
sjcl.codec.utf8String.fromBits
I get the following error:The
escape
andunescape
methods are deprecated in JS. Actually, why are we encoding/decoding for URI use in the first place? If needed why not split these out into separate methods?