Problem with string constant encoding in Avian

bigfatbrowncat commented 11 years ago

As far as I understand, Avian converts strings into UTF-8. That leads to some problems. Let me show:

public static void main(String[] args)
{
    System.out.println("Hello! Привет!");
}

The file is encoded in Codepage 866 (old cyrillic Russian encoding, used in Windows console).

Oracle JVM output in console:

Hello! Привет!

Avian output in console:

Hello! ╨П╨░╨Б╤Ю╥Р╨▓!

I thing the root cause is that the Oracle JVM takes string constants AS IS and just puts it into output, while Avian converts all constants from the current system encoding into UTF-8. This behaviour leads to a mess.

Let me show it: let's encode our source file in Windows-1251 (default Windows Cyrillic codepage) and run again.

Oracle JVM output in console:

Hello! ╧ЁштхЄ!

That strange output is correct! We loaded string from a file with Cp1251 and written it into Cp866 (console encoding). This is a default Russian Windows developers curse. We use two different encodings for Win32 UI and for console windows.

Avian output in console:

Hello! ╨Я╤А╨╕╨▓╨╡╤В!

Here we see a different result. What's this? We could answer easily. Let's change console window codepage to UTF-8

>chcp 65001

After that Avian output would be:

Hello! Привет!

If we change console encoding to Cp1251 with

>chcp 1251

Oracle JVM output would be correct:

Hello! Привет!

So just as I said before, Oracle JVM just copies text as is, but Avian converts it from Cp1251 to UTF-8.

Could you change this behavior to improve compatibility between VMs?

And, in addition, Avian default classpath doesn't support a simple way to change the I/O encoding. No methods that are recommended here: http://stackoverflow.com/questions/2415597/java-how-to-detect-and-change-encoding-of-system-console work for Avian.

There is no System.console() by default
You can't write PrintStream out = new PrintStream(System.out, true, "UTF-8"); cause there is no such constructor for PrintStream.
As far as I see, Avian doesn't support -Dfile.encoding (maybe I'm wrong)

If you could implement encoding conversion for PrintStream, that would be a great present...

joshuawarner32 commented 11 years ago

Keep in mind that the javac compiler normalizes all strings to UTF-8 (that's how they're encoded inside the class files). Avian maintains either a UTF-8 or UTF-16 internal encoding for strings (at runtime), and hotspot maintains a strict UTF-16 encoding - so any variance you see is from the string being reencoded for output, not Avian or Hotspot maintaining the original encoding in the source.

Of those three options, the second one makes the most sense to me. If you don't mind doing a little legwork, I can certainly help out. It will be a bit of a beast, though... character re-encoding isn't terribly fun.

You might also look at building avian with the openjdk classpath. It will be substantially larger - but it should "just work".

bigfatbrowncat commented 11 years ago

so any variance you see is from the string being reencoded for output, not Avian or Hotspot maintaining the original encoding in the source.

If I use Hotspot it outputs the strings just the same as they were in the source .java file. But Avian changes the encoding of the string. I don't know how it works inside, but as far as I understand it should be something like this:

Cp1251 -> UTF-8 -> Cp1251

You said that the first part is that the javac does. Ok, let it be. But the second part isn't handled well by Avian, so it tries to output UTF-8-encoded string to the default console window which isn't UTF-8 in Windows.

Am I right?

joshuawarner32 commented 11 years ago

The class files don't maintain any information about how the source file was encoded. If the output encoding happens to line up with the input encoding, that's just a coincidence. You're correct that the problem is that avian is (probably) just writing utf-8 to the terminal, rather than detecting the native encoding of the terminal and using that.

On windows, at least, there may be APIs that do the conversion for us - but I'm hesitant to use those as they're not cross-platform (and I'm not sure if equivalents actually exist for other platforms). The other option of course is using a pre-build cross-platform library like ICU - but that thing itself is a beast. Lastly, we could write conversion routines ourselves, but that's going to be painful and error-prone.

In general, I'd say adding support for non-unicode (i.e. anything but UTF-8, -16 or -32, really) encodings is something of a fool's errand. Such encodings are going the way of the Dodo.

That said, if you're interested in putting in the effort, I'm game.

joshuawarner32 commented 10 years ago

@bigfatbrowncat, would you mind if I close this?

duanyao commented 10 years ago

@joshuawarner32 Maybe wcstombs() function can do the conversion, and it is supported on *nix, windows, ios, android (>=2.3). http://www.cplusplus.com/reference/cstdlib/wcstombs/ http://www.cplusplus.com/reference/clocale/ http://msdn.microsoft.com/en-us/library/vstudio/5d7tc9zw.aspx https://developer.apple.com/library/ios/DOCUMENTATION/System/Conceptual/ManPages_iPhoneOS/man3/wcstombs.3.html http://www.kandroid.org/ndk/docs/system/libc/CHANGES.html

By the may, ios and android use only UTF-8 as system encoding so the conversion is actually not needed on them.

joshuawarner32 commented 10 years ago

@duanyao, true, but that doesn't give us much; that only deals with unicode formats (which aren't that difficult to encode anyway), not arbitrary code pages. Also keep in mind that what a wchar_t is highly platform dependent - it's typically either 2 or 4 bytes, and it's also only defined to be bigger than char. Indeed, there's no defined encoding for wchar_t strings (that I'm aware of) - so it really could be anything.

duanyao commented 10 years ago

@joshuawarner32 In theory, you are right, wchar_t can be in any encoding; but in practice, wchar_t is implemented as either UTF-16 (windows) or UTF-32 (unix like system). I haven't hear any modern OS is exceptional here. So, you can determine wchar_t's encoding by simply sizeof(wchar_t). PS, some embed system which can't afford UTF-16 or UTF-32 may define wchar_t as char and assume ASCII encoding (old android NDK do this), howerver, this is easy to handle.

joshuawarner32 commented 10 years ago

If we make that assumption, then it's only useful to us on windows anyway, because Avian's internal strings are (mostly...) UTF-16, to be compatible with Java. I wouldn't be too surprised to find some system that implemented it as UCS-2 (i.e. the old UTF-16). That would cause some subtle and rare bugs.

More importantly, it doesn't help us with pick-your-favorite-code-page, as @bigfatbrowncat wants.

duanyao commented 10 years ago

If I understand right, @bigfatbrowncat wants to use system default MBS encoding, which can be set by setlocale(LC_ALL,"") http://www.cplusplus.com/reference/clocale/setlocale/ .

UTF-16 to UTF-32 conversion should be easy. UCS-2 is same as UTF-16 except that it don't support code points can't be represented by 2 bytes. As I say above, none of the modern unix like OS and windows use UCS-2, so it should not be a big problem.

joshuawarner32 commented 10 years ago

Ah, I see. setlocale could work - but it might be hard to get "right", given the non-local nature of the operation. I'd be happy to review/accept patches in this direction - but I'm personally rather disenthused with old string encodings.

UTF-16 to UTF-32 conversion should be easy.

And inefficient. Not that avian really competes in the big leagues on performance, but still...

As I say above, none of the modern unix like OS and windows use UCS-2, so it should not be a big problem.

Sounds like you know more about it than me, so I'll defer to you here.

duanyao commented 10 years ago

Yes, if some client JNI codes call setlocale with different parameters, we are out of luck. But I think most develpers will not do that, because setting non–default locale is asking for trouble.

If you think 16 to 32 conversion is slow, you may ommit it. After all, most unix like systems use utf–8 as defaut MBS encoding, though user may change it.

I also hate windows' not using utf–8 as defaut MBS. However, this will not change in near future, so we'd better do something, even if not perfect.

duanyao commented 10 years ago

I think I can take some time fix this bug in next few weeks.

ReadyTalk / avian

Problem with string constant encoding in Avian #80