Unicode in predicates not working on Windows 10 platform

ghost commented 7 years ago

I just got a new Windows 10 machine, and was testing some of the unicode functionality of SWI-Prolog. But somehow it doesn't work. Is there a special option or setting to make it work?

Here are some example issues:

1) atom_codes/2 doesn't throw error message, just takes modulo. Current observable behaviour:

Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 7.3.31)
Copyright (c) 1990-2016 University of Amsterdam, VU Amsterdam

?- atom_codes(X, [66574, 120173]).
X = 'Ў항'.
?- atom_codes(X, [66574, 120173]), atom_codes(X, Y).
X = 'Ў항',
Y = [1038, 54637].

Expected behaviour (one Deseret Character and one Mathematical Symbol) no modulo is taken:

?- atom_codes(X, [66574, 120173]).
X = '𐐎𝕭'
?- atom_codes(X, [66574, 120173]), atom_codes(X, Y).
X = '𐐎𝕭',
Y = [66574,120173]

2) Top level does not take modulo, but throws exception. Current observable behaviour:

?- X = '\x1040E\'.
ERROR: Syntax error: Illegal character code
ERROR: X = '\
ERROR: ** here **
ERROR: x1040E\' .

Expected behaviour (one Deseret Character):

?- X = '\x1040E\'.
X = '𐐎'

3) char_code/2 does not take modulo, but throws exception. Current observable behaviour:

?- char_code(X, 120173).
ERROR: char_code/2: Cannot represent due to `character_code'

Expected behaviour (one Mathematical Symbol):

?- char_code(X, 120173).
X = '𝕭'

JanWielemaker commented 7 years ago

Is there a special option or setting to make it work?

Insert a Linux USB stick into your computer. The Windows version is broken as it basically handles wchar_t as USC-2, while in fact it is UTF-16. Using UTF-16 kind of bypasses the whole idea of using wchar_t in the first place: address characters as an array.

There are to ways out. One is to abandon Windows wchar_t, replace all wcs routines with our own that use USC-4 and support UTF-16 I/O conversion. The other is to migrate all internals to UTF-8 and adjust all code for dealing with the multibyte consequences thereof. It is not likely that any of this is going to happen any time soon unless some external developer or commercial party jumps in.

JanWielemaker commented 7 years ago

Added some tests. That indeed cannot hurt. Using type error now for > 0x10ffff and representation error for > 0xffff on Windows.

JanWielemaker commented 2 years ago

This issue has been mentioned on SWI-Prolog. There might be relevant details there:

https://swi-prolog.discourse.group/t/windows-unit-test-fails-for-semweb-ntriples/5197/1

SWI-Prolog / swipl-devel

Unicode in predicates not working on Windows 10 platform #178