OCamlPro / gnucobol

A clone of the sourceforge GnuCOBOL compiler from COBOL to C.
https://get-superbol.com
GNU Lesser General Public License v3.0
16 stars 21 forks source link

New options to deal with character encodings #77

Closed ddeclerck closed 2 months ago

ddeclerck commented 1 year ago

This PR adds some support for character encodings in different places, using iconv to build conversion tables.

Note to @GitMensch : I still have to add some tests, but you can start reviewing.

It introduces the following new options : -fnative-charset -ftarget-charset -fcodeset-ascii -fcodeset-ebcdic -fconvert-ascii -fconvert-ebcdic

The -fnative-charset and -ftarget-charset allow to specify the encodings used in the execution environment (native) and in the environment we are trying to simulate (target). When they are both set, the collating tables (for the purpose of sorting only) are built by combining these two encodings and the "vendor" EBCDIC/ASCII table.

Our typical use case is the following. We simulate a GCOS environment with EBCDIC-500 encoding (our target). We run on a standard Linux PC, where we decided that all the data we process would use the ISO-8859-1 encoding (we call this "native"). We have a file with some records we'd like to sort according to the EBCDIC collating sequence (the default one on this platform), so that our program sorts in the exact same order it would if it were run on our GCOS environment. If we use the "vendor" EBCDIC/ASCII table (the one from the GCOS documentation), this will fail: this table does not define an EBCDIC-500 <-> ISO-8859-1 translation (in particular, extended characters are just given consecutive values).

That's where the -fnative-charset and -ftarget-charset are useful. Here, we set -fnative-charset to ISO-8859-1 and -ftarget-charset to IBM500 (iconv's name for EBCDIC 500). As they are both set, we will call iconv to produce an ISO-8859-1 to EBCDIC-500 translation table (cob_colseq_native_ebcdic) and give that one instead of the vendor table to the sort function.

Now, what if the SORT instruction contained a COLLATING SEQUENCE IS ASCII clause ? Here, we would want to sort as our GCOS system would, hence using the vendor table. But iconv does not know about the specific ASCII encoding used by GCOS. What we want to do is thus to convert from ISO-8859-1 to EBCDIC-500 and then use the vendor table to translate that EBCDIC to the GCOS-specific ASCII. To avoid performing two conversions, we simply build a new conversion table (cob_colseq_native_ascii) that combines the two steps.

The implementation has been made generic enough so that it works on any combination of native/target encodings among the EBCDIC and ASCII encodings ; in particular, it will use the vendor table in one direction or the other depending on the actual target encoding (i.e. if the target encoding is EBCDIC, it will add a vendor EBCDIC -> ASCII translation when requesting an ASCII collating sequence, and symmetrically, if the target encoding is ASCII, it will add a vendor ASCII -> EBCDIC translation when requesting an EBCDIC collating sequence).

That's for the main stuff.

Now, since I had started digging about encodings, I thought it could also be useful to be able to specify the ASCII and EBCDIC encodings used in the CODE-SET and CONVERTING/REPLACING clauses. That's the purpose of the -fcodeset-ascii, -fcodeset-ebcdic, -fconvert-ascii and -fconvert-ebcdic options.

The -fcodeset-ascii and -fcodeset-ebcdic work together with the -fnative-charset option. When specified, and when a CODE-SET clause is active on a file, the necessary conversions will be performed by iconv using the specified encodings instead of instead of the vendor table. For instance, reading from a file that has CODE-SET IS EBCDIC will perform a translation from the encoding specified by the -fcodeset-ebcdic option to the encoding specified by the -fnative-charset option. Could be convenient if migrating an application from a mainframe while keeping data files in an EBCDIC encoding.

As for the -fconvert-ascii and -fconvert-ebcdic options, they allow to specify the conversions to perform in INSPECT CONVERTING / REPLACING and TRANSFORM CONVERTING.

codecov-commenter commented 1 year ago

Codecov Report

Merging #77 (41e67d8) into gcos4gnucobol-3.x (516b813) will decrease coverage by 0.20%. The diff coverage is 15.04%.

@@                  Coverage Diff                  @@
##           gcos4gnucobol-3.x      #77      +/-   ##
=====================================================
- Coverage              65.27%   65.07%   -0.20%     
=====================================================
  Files                     31       31              
  Lines                  56425    56623     +198     
  Branches               14723    14779      +56     
=====================================================
+ Hits                   36830    36846      +16     
- Misses                 13807    13973     +166     
- Partials                5788     5804      +16     
Impacted Files Coverage Δ
cobc/cobc.c 70.65% <0.00%> (-0.36%) :arrow_down:
cobc/parser.y 68.89% <ø> (-0.08%) :arrow_down:
libcob/cconv.c 37.50% <0.00%> (-38.98%) :arrow_down:
cobc/codegen.c 73.81% <18.79%> (-1.00%) :arrow_down:
cobc/flag.def 100.00% <100.00%> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

ddeclerck commented 1 year ago

The question is: Can any of the above be at least improved without adding the iconv dependency?

Well, if we restrict to EBCDIC <-> 8-bit ASCII translations (mostly ISO-8859-X) (and possibly UTF-X / UCS-X), it guess it would be acceptable to just copy the tables we need from iconv (the lib itself is LGPL, as GnuCobol runtime). Otherwise, if we'd like to support all kind of encodings, then we would end up just reimplementing iconv :p.

Anyways, the problem right now is that our client only needs these tables to get the SORT working as on his mainframe (I added support for CODE-SET and CONVERTING as a "bonus"), so we can hardly spend extra time implementing all the scenarios (if we had unlimited funding, I'd be happy to investigate all of these). Maybe we can think of a way to implement this quickly without impairing our ability to implement the remaining scenarios ?

GitMensch commented 1 year ago

Then what about only adding the necessary tables hard-wired for now? This could get into 3.2 final if it is quick and clean, too, and provide others with a hint how to add their own translations.

ddeclerck commented 1 year ago

Then what about only adding the necessary tables hard-wired for now? This could get into 3.2 final if it is quick and clean, too, and provide others with a hint how to add their own translations.

So you mean, adding another translation table use the current (-febcdic-table) option ?

GitMensch commented 1 year ago

Yes, that's what I can offer for 3.2 inclusion. Please add a comment on how you created that table.

... thinking again: even better would be to allow -febcdic-table to read in a file (checking just the length would be fine) and store it in the module.

ddeclerck commented 1 year ago

Yes, that's what I can offer for 3.2 inclusion. Please add a comment on how you created that table. ... thinking again: even better would be to allow -febcdic-table to read in a file (checking just the length would be fine) and store it in the module.

That would probably solve the problem in the easiest possible way. I'll do that in another PR.

GitMensch commented 1 year ago

Given -fdefault-colseq and #78 - where does this lead us concerning the issues above?

ddeclerck commented 1 year ago

Well, #78 and -fdefault-colseq do solve our current issue, yet the points raised here remain valid. Let's keep that here as a draft for reference for now.

GitMensch commented 2 months ago

@ddeclerck Can you please rebase this and then review what's missing feature-wise so far?

Ideally you could add / adjust some testcases, allowing everyone to grasp more easy what the new options are to be used for.

ddeclerck commented 2 months ago

Hi @GitMensch,

Actually this PR is rather obsolete ; the new option -fdefault-colseq (#72 #101) and the modifications to allow using custom translation tables in -febcdic-table (#78) are as useful and simpler to use (and easier to implement).

GitMensch commented 2 months ago

Its your PR, so if all things from the PR's initial post are covered otherwise, then feel free to close this (maybe adding a new label obsolete or similar, so its easier to distinguish from the closed "merged upstream" ones).

One thing that I've seen which was the reason to check the PRs: I'm not sure if -fdefault-colseq is applied correctly for SPECIAL-NAMES. ALPHABET nat IS NATIVE. - it would be helpful if you could check if that's the case and if we have a testcase for that.

ddeclerck commented 2 months ago

One thing that I've seen which was the reason to check the PRs: I'm not sure if -fdefault-colseq is applied correctly for SPECIAL-NAMES. ALPHABET nat IS NATIVE. - it would be helpful if you could check if that's the case and if we have a testcase for that.

To my understanding, -fdefault-colseq is like an implicit PROGRAM COLLATING SEQUENCE clause ; so should it really affect the NATIVE alphabet definition ?

GitMensch commented 2 months ago

Hm, MF has a NATIVE directive which changes the default collating sequence, but that doesn't change the default alphabet either. To change that you'd use the CHARSET directive (wich also changes DEFAULTBYTE, SIGN, NATIVE),

I thionk we don't have an option to effectively change the native alphabet, do we?

ddeclerck commented 2 months ago

Hm, MF has a NATIVE directive which changes the default collating sequence, but that doesn't change the default alphabet either. To change that you'd use the CHARSET directive (wich also changes DEFAULTBYTE, SIGN, NATIVE),

I thionk we don't have an option to effectively change the native alphabet, do we?

Not really, though that might have some use (we talked to someone who claimed he needed the internal, in-memory data representation to be in EBCDIC while running on an ASCII machine).