DOCGroup / ACE_TAO

ACE and TAO
https://www.dre.vanderbilt.edu/~schmidt/TAO.html
701 stars 381 forks source link

ACE/TAO Wide Strings on Linux #2145

Open wkbrd opened 1 year ago

wkbrd commented 1 year ago

Discussed in https://github.com/DOCGroup/ACE_TAO/discussions/2144

Originally posted by **wkbrd** October 16, 2023 We are attempting to use wide strings (CORBA::WChar*) on Linux, however we are finding that the encoding does not seem to behave as expected for characters whose representations do not fit into a single UTF-16 character. On Linux, wchar_t is a 32-bit type. When a wide string comes over the wire, including from other ORBs, the default behavior is that each character in the string is the next 16-bit unit of the UTF-16 representation of the string, rather than the next character in the native UTF-32 layout. Is this intended? For example, 🂡🂮🂭🂫🂪 is being marshaled as L"\xd83c\xdca1\xd83c\xdcae\xd83c\xdcad\xd83c\xdcab\xd83c\xdcaa" A search of the source tree found references to -ORBNativeWcharCodeSet UCS-4 -ORBWcharCodesetTranslator WUCS4_UTF16_Factory in TAO/tests/CodeSets/simple/. Use of -ORBNativeWcharCodeSet UCS-4 produces an outcome where the string is represented as an array of wchar_t twice the length of the UTF-32 representation where each pair of elements are filled with the low 16 bits of the UTF-32 character followed by the high 16 bits of the UTF-32 character. For example, 🂡🂮🂭🂫🂪 is being marshaled as L"¡\001®\001¬\001«\001ª\001" > (gdb) print /x wstrStreamName[0] $2 = 0xf0a1 (gdb) print /x wstrStreamName[1] $3 = 0x1 (gdb) print /x wstrStreamName[2] $4 = 0xf0ae (gdb) print /x wstrStreamName[3] $5 = 0x1 (gdb) print /x wstrStreamName[4] $6 = 0xf0ad (gdb) print /x wstrStreamName[5] $7 = 0x1 (gdb) print /x wstrStreamName[6] $8 = 0xf0ab (gdb) print /x wstrStreamName[7] $9 = 0x1 (gdb) print /x wstrStreamName[8] $10 = 0xf0aa (gdb) print /x wstrStreamName[9] $11 = 0x1 (gdb) print /x wstrStreamName[10] $12 = 0x0 (gdb) print /x wstrStreamName[11] $13 = 0x0 Character reference: https://en.wikipedia.org/wiki/Playing_cards_in_Unicode

Based on discussion content, we proceeded to attempt to use WUCS4_UTF16.cpp, though encountered issues. The associated PR addresses the issues.

saper commented 11 months ago

Did you build ACE with uses_wchar?

On a very recent FreeBSD with LLVM 16 I get build failures in ACEXML with zzip and ACEXML/common/ZipCharStream.cpp - ACEXML_Char is wchar_t and zip/zzip libraries return ordinary (char) values that are not compatible with (ACEXML_Char).

(may be this is totally unrelated to what you are doing)

mitza-oci commented 11 months ago

(may be this is totally unrelated to what you are doing)

It does seem to be unrelated, please open a new issue/discussion.