buganini / bsdconv

A simple but powerful DSL for charset/encoding conversion and transformation, pure C implementation with no extra dependencies
https://bsdconv.io/bsdconv/
BSD 2-Clause "Simplified" License
53 stars 6 forks source link

FR: Supporting EBCDIC (SBCS/DBCS) #7

Closed roytam1 closed 11 years ago

roytam1 commented 11 years ago

I will be very useful to support EBCDIC. Especially EBCDIC DBCS is supported by Java only in Windows environment.

buganini commented 11 years ago

Do you know where to find mapping tables?

roytam1 commented 11 years ago

http://www.opensource.apple.com/source/ICU/ICU-491.11.1/icuSources/data/mappings/ibm-930_P120-1999.ucm?txt http://www.opensource.apple.com/source/ICU/ICU-491.11.1/icuSources/data/mappings/ibm-933_P110-1995.ucm?txt http://www.opensource.apple.com/source/ICU/ICU-491.11.1/icuSources/data/mappings/ibm-935_P110-1999.ucm?txt http://www.opensource.apple.com/source/ICU/ICU-491.11.1/icuSources/data/mappings/ibm-937_P110-1999.ucm?txt http://www.opensource.apple.com/source/ICU/ICU-491.11.1/icuSources/data/mappings/ibm-939_P120-1999.ucm?txt

EBCDIC DBCS encodings use SO for changing state to DBCS mode and SI for changing state to SBCS mode.

buganini commented 11 years ago

Is there a "standard" SBCS mapping ?

And where to find test data?

roytam1 commented 11 years ago

the SBCS will normally use ibm-37 http://www.opensource.apple.com/source/ICU/ICU-491.11.1/icuSources/data/mappings/ibm-37_P100-1995.ucm?txt (comment 2 is updated)

I made some test data here: http://roy.orz.hm/soft/ebcdic-test/ *.utf8.txt are UTF-8 output

buganini commented 11 years ago

Is it that all control characters should be converted and preserved except for SI/SO ?

roytam1 commented 11 years ago

All characters should be converted unless codepage is DBCS and SO/SI is in pair, then SO/SI pair itself will be removed.

buganini commented 11 years ago

IBM-930 is not a superset of IBM-37, does every MBCS has its own SBCS definition?

buganini commented 11 years ago

IBM-930 has a weird secondary charmap, do you know why/what ?

roytam1 commented 11 years ago

Yes, it various. IBM-939 should have same SBCS encoding as IBM-37

http://en.wikipedia.org/wiki/EBCDIC_930

buganini commented 11 years ago

decoders are done, are encoders wanted?

roytam1 commented 11 years ago

Yes, it would be nice to have. :-)

buganini commented 11 years ago

done

buganini commented 11 years ago

Please test it , if everything is fine, I'll bump a new release.

roytam1 commented 11 years ago

It looks OK from my preliminary testing, I found that bsdconv removes \r from \r\n commandline: bsdconv utf-8:ibm-937 b5.utf8.txt > b5.ibm937.txt

roytam1 commented 11 years ago

a bit off topic but, is it possible to keep minimum files for limited/specified function? for example, I only want to have UTF-8 <--> EBCDIC function, which files should I keep? Or completely embedding files into libbsdconv.dll?

buganini commented 11 years ago

bsdconv doesn't remove \r from \r\n, unless you use utf-8:unix:ibm-937 $ perl -e 'print "a\r\nb"' | bsdconv utf-8:ibm-937|hexdump -C 00000000 81 0d 25 82 |..%.| 00000004

Might it be done by shell redirection? Could you give me b5.ibm937.txt to verify the problem?

Bsdconv itself uses: from/ASCII from/PASS from/PASS.dll inter/FROM_ALIAS inter/FROM_ALIAS.dll inter/INTER_ALIAS inter/TO_ALIAS inter/TO_ALIAS.dll to/ASCII to/PASS to/PASS.dll

Further requirements for utf-8 -> ebcdic: from/_UTF8 from/_UTF8.dll to/IBM-*

roytam1 commented 11 years ago

BTW bsdconv fails to build with mingw-w64 headers:

In file included from src/libbsdconv.c:26:0:
src/bsdconv.h:38:2: error: 'INPUT' redeclared as different kind of symbol
In file included from d:\msys\mingw\bin\../lib/gcc/i686-w64-mingw32/4.7.1/../../../../i686-w64-mingw32/include/windows.h:62:0,
                 from src/bsdconv.h:24,
                 from src/libbsdconv.c:26:
d:\msys\mingw\bin\../lib/gcc/i686-w64-mingw32/4.7.1/../../../../i686-w64-mingw32/include/winuser.h:2311:5: note: previous declaration of 'INPUT' was here
roytam1 commented 11 years ago

bsdconv doesn't remove \r from \r\n, unless you use utf-8:unix:ibm-937

but fopen auto-translation is on by default in win32 unless you open with "rb", "wb" mode. (your bsdconv_mkstemp() is fine here, but fopen() is missing "b" flag)

buganini commented 11 years ago

Oops, fixed.

Could you try to replace INPUT with _INPUT or SOURCE or somethings else in src/bsdconv.h: INPUT, src/libbsdconv.c: ins->phase[0].type=INPUT; on mingw-w64

roytam1 commented 11 years ago

Oops, fixed.

but it now turns \r\n to \r\r\n.

Could you try to replace INPUT with _INPUT or SOURCE or somethings else in

yes it does work here.

buganini commented 11 years ago

Could you try bsdconv utf-8:ibm-937 -i b5.utf8.txt to do inplace conversion? \r\n -> \r\r\n might be done by redirection (>).

roytam1 commented 11 years ago

yes, in-place works.

so we need

   setmode(STDIN_FILENO, O_BINARY);
   setmode(STDOUT_FILENO, O_BINARY);

when compiling for win32.

buganini commented 11 years ago

committed, could you please test again?

roytam1 commented 11 years ago
#include <fcntl.h>

is needed for O_BINARY

roytam1 commented 11 years ago

and many small tables(for example from/3F) becoming huge (from 2,088 bytes to 269,352 bytes)

roytam1 commented 11 years ago

and from/_GB18030 also (from 1,929,558 bytes to 18,281,535 bytes)

buganini commented 11 years ago

fixed.

roytam1 commented 11 years ago

OK others are fixed. but to/IBM-9* is huge (for example to/IBM-937 is 21,443,222 bytes) when comparing with from/IBM-9* (from/IBM-937 is 898,596 bytes) is that normal?

buganini commented 11 years ago

another fix is just pushed.

roytam1 commented 11 years ago

another fix is just pushed.

Alright, everything works, and table size are sane, great!