ceylon / ceylon.language

DEPRECATED
Apache License 2.0
153 stars 57 forks source link

Add String.bytes + Character.bytes and possibly change Character constructor #519

Open gavinking opened 10 years ago

gavinking commented 10 years ago

We need to add bytes attributes to String and Character to get the UTF-8 encoding. We should also consider the problem of going in the other direction, Byte[]->Character/String, perhaps changing the constructor of Character to accept Byte[0]|Byte[2]|Byte[4].

gavinking commented 10 years ago

Hrm, this would require adding an algorithm to encode UTF-16->UTF-8 to the JS language module, since JS doesn't have this built in. Apparently the code would look something like this:

http://stackoverflow.com/questions/18729405/how-to-convert-utf8-string-to-byte-array

Perhaps this is an unacceptable bloat? Not sure. WDYT, @FroMage @chochos?

quintesse commented 10 years ago

An attribute to get UTF-8 encoding? And what if I want UTF-16 or ASCII or whatever other encoding? And I don't think using Byte[1]|Byte[2]|Byte[4] to go the other direction is very useful, Byte[1] could still be one of a great number of ASCII encodings besides UTF-8. And even the UTF16 and UTF32 versions would often come in a simple Byte[1] buffer. (Which would maybe make it necessary to create functions that turns a Byte[1] into a Byte[2] or Byte[4] of half or quarter size.)

gavinking commented 10 years ago

@quintesse Look this is a really simple thing. In Java we have String.getBytes() to turn a string into a byte array. We don't offer the same thing in Ceylon.

quintesse commented 10 years ago

The Java getBytes() is an overload of getBytes(encoding) defaulting to the system's default locale!

quintesse commented 10 years ago

But I understand we can't go around adding encodings to all possibilities from within the Ceylon language module, that's what we have the SDK for. But then I'd go simply for one format, in and out, either UTF8 (means adding some kind of coder/decoder to Javascript) or UTF16 (which both backends handle and is their internal representation)

gavinking commented 10 years ago

Ah, in fact apparently I misunderstood something. In Java, String.getBytes() uses the platform's default encoding, not UTF-8. I agree that this is something that is significantly less useful.

It still seems to me like being able to encode a String to UTF8 is something that's so useful that we could privilege it like this.

gavinking commented 10 years ago

@quintesse Huh? How would you represent a UTF-16 encoding as an array of bytes?

quintesse commented 10 years ago

@gavinking Wdym? The encoding is always an array of bytes. That's why Java's getBytes() takes an "encoding" parameter, you pass it "UTF-16" and you get a byte array representing the String's contents encoded as UTF-16 (which in case of Java is its internal representation so no conversion is done in that case).

gavinking commented 10 years ago

Oh OK. I dunno. Somehow that just seems weird to me to represent a UTF-16 encoding using a byte array. Surely an int array is the more natural representation? Makes sense for doing IO, of course, but at the language level?

tombentley commented 10 years ago

Er, doesn't UTF-16 come in big and little endian flavours? And if we're not careful we'll get into BOMs. Eugh!

quintesse commented 10 years ago

@tombentley Yes, which is why a byte array makes more sense. That way if you write it to a file you know everything will be in the correct order etc. Once you start dealing with integer arrays you're again in a world of hurt. Bytes are just the most "neutral" thing you can work with.

And BOMs only count for file formats, they're not taken into account when dealing with in-memory representations.

gavinking commented 10 years ago

Not for 1.1, since it seems there is no consensus on whether it would be a good thing.