Open gavinking opened 10 years ago
Hrm, this would require adding an algorithm to encode UTF-16->UTF-8 to the JS language module, since JS doesn't have this built in. Apparently the code would look something like this:
http://stackoverflow.com/questions/18729405/how-to-convert-utf8-string-to-byte-array
Perhaps this is an unacceptable bloat? Not sure. WDYT, @FroMage @chochos?
An attribute to get UTF-8 encoding? And what if I want UTF-16 or ASCII or whatever other encoding?
And I don't think using Byte[1]|Byte[2]|Byte[4]
to go the other direction is very useful, Byte[1]
could still be one of a great number of ASCII encodings besides UTF-8. And even the UTF16 and UTF32 versions would often come in a simple Byte[1]
buffer. (Which would maybe make it necessary to create functions that turns a Byte[1]
into a Byte[2]
or Byte[4]
of half or quarter size.)
@quintesse Look this is a really simple thing. In Java we have String.getBytes()
to turn a string into a byte array. We don't offer the same thing in Ceylon.
The Java getBytes()
is an overload of getBytes(encoding)
defaulting to the system's default locale!
But I understand we can't go around adding encodings to all possibilities from within the Ceylon language module, that's what we have the SDK for. But then I'd go simply for one format, in and out, either UTF8 (means adding some kind of coder/decoder to Javascript) or UTF16 (which both backends handle and is their internal representation)
Ah, in fact apparently I misunderstood something. In Java, String.getBytes()
uses the platform's default encoding, not UTF-8. I agree that this is something that is significantly less useful.
It still seems to me like being able to encode a String
to UTF8 is something that's so useful that we could privilege it like this.
@quintesse Huh? How would you represent a UTF-16 encoding as an array of bytes?
@gavinking Wdym? The encoding is always an array of bytes. That's why Java's getBytes()
takes an "encoding" parameter, you pass it "UTF-16" and you get a byte array representing the String's contents encoded as UTF-16 (which in case of Java is its internal representation so no conversion is done in that case).
Oh OK. I dunno. Somehow that just seems weird to me to represent a UTF-16 encoding using a byte array. Surely an int array is the more natural representation? Makes sense for doing IO, of course, but at the language level?
Er, doesn't UTF-16 come in big and little endian flavours? And if we're not careful we'll get into BOMs. Eugh!
@tombentley Yes, which is why a byte array makes more sense. That way if you write it to a file you know everything will be in the correct order etc. Once you start dealing with integer arrays you're again in a world of hurt. Bytes are just the most "neutral" thing you can work with.
And BOMs only count for file formats, they're not taken into account when dealing with in-memory representations.
Not for 1.1, since it seems there is no consensus on whether it would be a good thing.
We need to add
bytes
attributes toString
andCharacter
to get the UTF-8 encoding. We should also consider the problem of going in the other direction,Byte[]
->Character
/String
, perhaps changing the constructor ofCharacter
to acceptByte[0]|Byte[2]|Byte[4]
.