OCamlPro / gnucobol

A clone of the sourceforge GnuCOBOL compiler from COBOL to C.
https://get-superbol.com
GNU Lesser General Public License v3.0
20 stars 23 forks source link

"FREE" source format means X/Open free-form under MF #68

Closed nberth closed 1 year ago

nberth commented 1 year ago

I've recently come across some COBOL source files apparently targeting MicroFocus, where comments could be seen as lines starting with a * after a $ SET SOURCEFORMAT"FREE" directive. This appears to be backed up by https://www.microfocus.com/documentation/visual-cobol/vc50pu7/VS2019/HRLHLHINTR01U008.html . This PR is a suggestion to handle *|/-starting lines as comments when the -fmfcomment is given, even in FREE reference source format.

The space between $ and SET in the directive is neither documented not supported by GnuCOBOL. I'm not sure whether this is actually allowed.

GitMensch commented 1 year ago

The main issue is that this isn't COBOL free-form reference-format but XOPEN "free form reference-format".

The X/Open COBOL definition for free form reference format is modelled after the CODASYL COBOL reference format for free form source. [...]

  • When appearing in the first character position of a line, the single-byte symbols ‘‘∗’’ and ‘‘/’’ are always interpreted as indicating comment lines.
  • When appearing in the first two character positions of a line, the single-byte symbol pair ‘‘D ’’ (D followed by space) is always interpreted as a debugging line.

-fmf-comment adjusts fixed-form only (an alternative would be a CB_FORMAT_MF_FIXED which implies this). CB_FORMAT_XOPEN_FFF does handle "*" and "/" and "D " correctly without additional flags, doesn't it? If not then this should definitely be fixed.

The space between $ and SET in the directive is neither documented not supported by GnuCOBOL. I'm not sure whether this is actually allowed.

I've rechecked that: Yes, this is allowed with MF, there should be a bug report on this in GnuCOBOL and if/when possible a fix.

nberth commented 1 year ago

-fmf-comment adjusts fixed-form only (an alternative would be a CB_FORMAT_MF_FIXED which implies this). CB_FORMAT_XOPEN_FFF does handle "*" and "/" and "D " correctly without additional flags, doesn't it? If not then this should definitely be fixed.

Ok I see. The files I've seen always start with the SOURCEFORMAT"FREE" directive, so resorting to X/Open, or adding other formats with a different name, might not solve the issue completely. The alternative is to change the interpretation of FREE using as specific flag… not sure this is very clean though. In addition, as the suggested change did not break any test, that might indicate the current tests for COBOL free-form reference-format might not be incomplete (e.g., a * at the beginning of a line that need to be interpreted as an arithmetic operation is probably not tested).

The space-after-$ issue seems rather easy to fix at first glance; I'll see to prepare a new PR for this. Does this work on MF for any directive or just SET?

GitMensch commented 1 year ago

The space-after-$ issue seems rather easy to fix at first glance; I'll see to prepare a new PR for this. Does this work on MF for any directive or just SET?

Nice, works on all I've tested, which includes conditional compilation and SET - just go with all.

GitMensch commented 1 year ago

Ok I see. The files I've seen always start with the SOURCEFORMAT"FREE" directive, so resorting to X/Open, or adding other formats with a different name, might not solve the issue completely.

Good point, that means that reading the SOURCEFORMAT directive with -std=mf (check the dialect flag) should set the CB_FORMAT_XOPEN_FFF, similar - if a fixed format for MF is added, which seems reasonable - this should also be selected automatically for $ SET SOURCEFORMAT FREE (actually set or implied). ... So that means that -fixed / -free should also be dialect aware.

nberth commented 1 year ago

Good point, that means that reading the SOURCEFORMAT directive with -std=mf (check the dialect flag) should set the CB_FORMAT_XOPEN_FFF, similar - if a fixed format for MF is added, which seems reasonable - this should also be selected automatically for $ SET SOURCEFORMAT FREE (actually set or implied). ... So that means that -fixed / -free should also be dialect aware.

Poking about the docs I think we need another format to handle this, like a CB_FORMAT_MF_FREE, that is very similar to X/Open free-form format but allows lines of 256 single-byte characters (instead of 80 "character positions" for X/Open FFF). As per the MF doc on "FREE" format: "There is no fixed right margin, though for practical purposes this implementation restricts the maximum length of a source line to 256 bytes of source when represented in UTF8 encoding."

Then, to handle $ SET SOURCEFORMAT"FREE" properly, maybe it's a matter of defining dialect-specific dictionaries of source format names, with a default and an MF-specific one that remaps "FREE". If this dictionary can then be selected using a dialect option like source-format-dict=[mf|default], we'd need a way to ensure that -fformat=free -fsource-format-dict=mf and -fsource-format-dict=mf -fformat=free behave the same. I'm not sure whether this is easy to do with the current way configuration options are handled… A more naive but simpler approach would be to treat $ SET SOURCEFORMAT"FREE", which is an MF extension, differently than the >>SOURCE equivalent (with possibly for the MF one a configurable note or warning about the name mismatch).

nberth commented 1 year ago

Nice, works on all I've tested, which includes conditional compilation and SET - just go with all.

One last detail: is any number of space allowed, or a single one?

GitMensch commented 1 year ago

Nice, works on all I've tested, which includes conditional compilation and SET - just go with all.

One last detail: is any number of space allowed, or a single one?

Rechecked: both "any number of space" and/or any number of tabs.

GitMensch commented 1 year ago

Poking about the docs I think we need another format to handle this, like a CB_FORMAT_MF_FREE, that is very similar to X/Open free-form format but allows lines of 256 single-byte characters (instead of 80 "character positions" for X/Open FFF). As per the MF doc on "FREE" format: "There is no fixed right margin, though for practical purposes this implementation restricts the maximum length of a source line to 256 bytes of source when represented in UTF8 encoding."

As we don't explicit handle "character positions" we should go with "possible bytes per character" - and this would mean 1-4 for UTF8 so I'd say 256 "practical limit" is fine.

to handle $ SET SOURCEFORMAT"FREE" properly, maybe it's a matter of defining dialect-specific dictionaries of source format names

Nah, just check for cb_std_def (CB_STD_MF) when parsing that, and possibly add to cb_std_def as needed.

nberth commented 1 year ago

As we don't explicit handle "character positions" we should go with "possible bytes per character" - and this would mean 1-4 for UTF8 so I'd say 256 "practical limit" is fine.

The current implementation restricts X/Open free-form format to 80 columns. So I was probably wrong when setting that limit, and should have selected 320 instead as this is a count of bytes and not characters.

That's leaving us with the only minor difference of MF explicitly allowing "d " and "D " as debug indicators, and X/Open only explicitly allowing "D " (but "d " might just be not documented).

nberth commented 1 year ago

Current changes also start to remove some "free-from" formats from those eligible for Area A enforcement: for some reason I included some terminal-like formats in this set originally. I'm mostly sure about X/Open free-format that mentions it has no Area A. I'll need to check the others (CRT, TERMINAL, and XCARD).

GitMensch commented 1 year ago

... and I've just recognized that we don't handle MFCOMMENT correctly either - if active it does not only set that in fixed format an asterisk at column 1 means "comment line" - it actually means "comment line with listing suppression".

... not sure if you get to that (all in cobc/cobc.c) part, but wanted to drop it as "related". Could be done here or with a follow-up PR.

GitMensch commented 1 year ago

Thank you. Ready for svn if you don't want to add $SET DIALECT before :-)

codecov-commenter commented 1 year ago

Codecov Report

Merging #68 (9b6bbf1) into gcos4gnucobol-3.x (173790f) will decrease coverage by 0.00%. The diff coverage is 86.66%.

@@                  Coverage Diff                  @@
##           gcos4gnucobol-3.x      #68      +/-   ##
=====================================================
- Coverage              65.08%   65.08%   -0.01%     
=====================================================
  Files                     34       34              
  Lines                  55907    55910       +3     
  Branches               14678    14683       +5     
=====================================================
+ Hits                   36386    36387       +1     
  Misses                 13661    13661              
- Partials                5860     5862       +2     
Impacted Files Coverage Δ
cobc/flag.def 100.00% <ø> (ø)
cobc/cobc.c 70.11% <85.71%> (+<0.01%) :arrow_up:
cobc/pplex.l 71.33% <87.50%> (-0.10%) :arrow_down:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

nberth commented 1 year ago

Thank you. Ready for svn if you don't want to add $SET DIALECT before :-)

Just at attempt at better honoring MFCOMMENT w.r.t listings. Please check the tests, as I'm not sure this is indeed the expected behavior.

nberth commented 1 year ago

I think I'll leave auto-enabling mf/acucomment for later. Not fan of putting an ad-hoc piece of code in cb_load_std, although it's tempting as it is indeed the simplest approach…

nberth commented 1 year ago

Now upstream :tada: