Closed roytam1 closed 11 years ago
Do you know where to find mapping tables?
http://www.opensource.apple.com/source/ICU/ICU-491.11.1/icuSources/data/mappings/ibm-930_P120-1999.ucm?txt http://www.opensource.apple.com/source/ICU/ICU-491.11.1/icuSources/data/mappings/ibm-933_P110-1995.ucm?txt http://www.opensource.apple.com/source/ICU/ICU-491.11.1/icuSources/data/mappings/ibm-935_P110-1999.ucm?txt http://www.opensource.apple.com/source/ICU/ICU-491.11.1/icuSources/data/mappings/ibm-937_P110-1999.ucm?txt http://www.opensource.apple.com/source/ICU/ICU-491.11.1/icuSources/data/mappings/ibm-939_P120-1999.ucm?txt
EBCDIC DBCS encodings use SO for changing state to DBCS mode and SI for changing state to SBCS mode.
Is there a "standard" SBCS mapping ?
And where to find test data?
the SBCS will normally use ibm-37 http://www.opensource.apple.com/source/ICU/ICU-491.11.1/icuSources/data/mappings/ibm-37_P100-1995.ucm?txt (comment 2 is updated)
I made some test data here: http://roy.orz.hm/soft/ebcdic-test/ *.utf8.txt are UTF-8 output
Is it that all control characters should be converted and preserved except for SI/SO ?
All characters should be converted unless codepage is DBCS and SO/SI is in pair, then SO/SI pair itself will be removed.
IBM-930 is not a superset of IBM-37, does every MBCS has its own SBCS definition?
IBM-930 has a weird secondary charmap, do you know why/what ?
Yes, it various. IBM-939 should have same SBCS encoding as IBM-37
decoders are done, are encoders wanted?
Yes, it would be nice to have. :-)
done
Please test it , if everything is fine, I'll bump a new release.
It looks OK from my preliminary testing, I found that bsdconv removes \r from \r\n commandline: bsdconv utf-8:ibm-937 b5.utf8.txt > b5.ibm937.txt
a bit off topic but, is it possible to keep minimum files for limited/specified function? for example, I only want to have UTF-8 <--> EBCDIC function, which files should I keep? Or completely embedding files into libbsdconv.dll?
bsdconv doesn't remove \r from \r\n, unless you use utf-8:unix:ibm-937 $ perl -e 'print "a\r\nb"' | bsdconv utf-8:ibm-937|hexdump -C 00000000 81 0d 25 82 |..%.| 00000004
Might it be done by shell redirection? Could you give me b5.ibm937.txt to verify the problem?
Bsdconv itself uses: from/ASCII from/PASS from/PASS.dll inter/FROM_ALIAS inter/FROM_ALIAS.dll inter/INTER_ALIAS inter/TO_ALIAS inter/TO_ALIAS.dll to/ASCII to/PASS to/PASS.dll
Further requirements for utf-8 -> ebcdic: from/_UTF8 from/_UTF8.dll to/IBM-*
BTW bsdconv fails to build with mingw-w64 headers:
In file included from src/libbsdconv.c:26:0:
src/bsdconv.h:38:2: error: 'INPUT' redeclared as different kind of symbol
In file included from d:\msys\mingw\bin\../lib/gcc/i686-w64-mingw32/4.7.1/../../../../i686-w64-mingw32/include/windows.h:62:0,
from src/bsdconv.h:24,
from src/libbsdconv.c:26:
d:\msys\mingw\bin\../lib/gcc/i686-w64-mingw32/4.7.1/../../../../i686-w64-mingw32/include/winuser.h:2311:5: note: previous declaration of 'INPUT' was here
bsdconv doesn't remove \r from \r\n, unless you use utf-8:unix:ibm-937
but fopen auto-translation is on by default in win32 unless you open with "rb", "wb" mode. (your bsdconv_mkstemp() is fine here, but fopen() is missing "b" flag)
Oops, fixed.
Could you try to replace INPUT with _INPUT or SOURCE or somethings else in src/bsdconv.h: INPUT, src/libbsdconv.c: ins->phase[0].type=INPUT; on mingw-w64
Oops, fixed.
but it now turns \r\n to \r\r\n.
Could you try to replace INPUT with _INPUT or SOURCE or somethings else in
yes it does work here.
Could you try bsdconv utf-8:ibm-937 -i b5.utf8.txt to do inplace conversion? \r\n -> \r\r\n might be done by redirection (>).
yes, in-place works.
so we need
setmode(STDIN_FILENO, O_BINARY);
setmode(STDOUT_FILENO, O_BINARY);
when compiling for win32.
committed, could you please test again?
#include <fcntl.h>
is needed for O_BINARY
and many small tables(for example from/3F) becoming huge (from 2,088 bytes to 269,352 bytes)
and from/_GB18030 also (from 1,929,558 bytes to 18,281,535 bytes)
fixed.
OK others are fixed. but to/IBM-9* is huge (for example to/IBM-937 is 21,443,222 bytes) when comparing with from/IBM-9* (from/IBM-937 is 898,596 bytes) is that normal?
another fix is just pushed.
another fix is just pushed.
Alright, everything works, and table size are sane, great!
I will be very useful to support EBCDIC. Especially EBCDIC DBCS is supported by Java only in Windows environment.