Incorrect encoding for alphanumeric literals using hexadecimal notation

fm-117 commented 4 months ago

What is the problem ?

The scanner uses the MulitlineScanState.EncodingForAlphanumericLiterals property to get the string value of alphanumeric literals described using the hexadecimal notation. However this property gets its value from the encoding of the source file which is a different notion.

Here are the IBM specs for alphanumeric literals written in hex:

Hexadecimal digits are characters in the range '0' to '9', 'a' to 'f', and 'A' to 'F', inclusive. Two hexadecimal digits represent one character in a single-byte character set (EBCDIC or ASCII). Four hexadecimal digits represent one character in a DBCS character set. A string of EBCDIC DBCS characters represented in hexadecimal notation must be preceded by the hexadecimal representation of a shift-out control character (X'0E') and followed by the hexadecimal representation of a shift-in control character (X'0F'). An even number of hexadecimal digits must be specified. The maximum length of a hexadecimal literal is 320 hexadecimal digits.

The continuation rules are the same as those for any alphanumeric literal. The opening delimiter (X" or X') cannot be split across lines.

The DBCS compiler option has no effect on the processing of hexadecimal notation of alphanumeric literals.

How to fix ?

The clients should be able to specify both encoding of their sources and encoding to use to read hex literals so we need a new option
A sensible default value should be used
- EBCDIC 1147 is the default in our company but it may not be the most widely used encoding
- Whichever default value is used, this will result in a breaking change

fm-117 commented 4 months ago

See DISPLAY.CodeElements.txt for an example of wrong text value: https://github.com/TypeCobolTeam/TypeCobol/blob/f568ebe67766c1860646407367c492a3a886b827/TypeCobol.Test/Parser/CodeElements/DISPLAY.CodeElements.txt#L96-L97

fm-117 commented 4 months ago

As for now:

use IBM-1140 as default for Debug and Release configurations
use IBM-1147 as default for EI_Debug and EI_Release which are specific to our own internal use

fm-117 commented 4 months ago

Partially fixed by #2633.

We still need:

to create an option to change the value externally (for LS and CLI users)
to account for CODEPAGE option declared directly in source: the scanner should read the value and dynamically change the encoding of literals for the rest of the document

TypeCobolTeam / TypeCobol

Incorrect encoding for alphanumeric literals using hexadecimal notation #2632