Missing carriage return in latin_list_1.3.txt

johnp commented 6 months ago

Hi,

in latin_list_1.3.txt the carriage return line is missing the carriage return (0D byte) at the end of the line:

https://github.com/String-Latin/DIN-91379-Characters-and-Sequences/blob/333031c992cc97b518f32e027fcce3c652e334c3/latin_list_1.3.txt#L737

latin_list_1.2.txt correctly contains the 0D byte at that location:

https://github.com/String-Latin/DIN-91379-Characters-and-Sequences/blob/333031c992cc97b518f32e027fcce3c652e334c3/latin_list_1.2.txt#L734

Maybe git's autocrlf feature dropped it (autocrlf should maybe be disabled for the .txt files in this repo via .gitattributes?).

vk-github18 commented 6 months ago

The lines look different, because latin_list_1.2.txt is has MSDOS line endings, whereas latin_list_1.3.txt has Unix line endings. The character 0D is not missing.

johnp commented 6 months ago

I did indeed miss the difference in line endings. It still appears to me that the 0d byte is entirely missing in latin_list_1.3.txt, and apparently has been converted to a 0d0a in the latin_list_1.2.txt. I've verified this by directly downloading the files to avoid any git autocrlf conversion on my side.

curl -s https://raw.githubusercontent.com/String-Latin/DIN-91379-Characters-and-Sequences/333031c992cc97b518f32e027fcce3c652e334c3/latin_list_1.2.txt | head -n 736 | tail -n 4 | xxd

00000000: 0d0a 626e 6c6e 6f74 3b20 6368 6172 3b20  ..bnlnot; char;
00000010: 3030 3044 3b20 4341 5252 4941 4745 2052  000D; CARRIAGE R
00000020: 4554 5552 4e20 2843 5229 3b20 0d0a 0d0a  ETURN (CR); ....
00000030: 626e 6c6e 6f74 3b20 6368 6172 3b20 3030  bnlnot; char; 00
00000040: 4130 3b20 4e4f 2d42 5245 414b 2053 5041  A0; NO-BREAK SPA
00000050: 4345 3b20 c2a0 0d0a                      CE; ....

Notice the 0d0a 0d0a sequence. This should, if I understand the file format correctly, actually be 0d0d 0a. (first the actual character the line describes, and then the MSDOS line ending, just like the next line ending with c2a0 0d0a (nbsp followed by MSDOS line ending))

Presumably, some line ending conversion logic seems to have added 0a after the 0d. This also happened the other way around: The Line Feed (LF) line also ends with 0d0a 0d0a, although it is supposed to end with 0a0d 0a.

curl -s https://raw.githubusercontent.com/String-Latin/DIN-91379-Characters-and-Sequences/333031c992cc97b518f32e027fcce3c652e334c3/latin_list_1.3.txt | head -n 738 | tail -n 3 | xxd

00000000: 0a62 6e6c 6e6f 743b 2063 6861 723b 2030  .bnlnot; char; 0
00000010: 3030 443b 2043 4152 5249 4147 4520 5245  00D; CARRIAGE RE
00000020: 5455 524e 2028 4352 293b 200a 626e 6c6e  TURN (CR); .bnln
00000030: 6f74 3b20 6368 6172 3b20 3030 4130 3b20  ot; char; 00A0;
00000040: 4e4f 2d42 5245 414b 2053 5041 4345 3b20  NO-BREAK SPACE;
00000050: c2a0 0a                                  ...

Here there's no 0d at all, just the 0a Unix line ending. Presumably, line ending conversion dropped it.

Of course I don't know where the line ending conversion took place. I've opened a PR at https://github.com/String-Latin/DIN-91379-Characters-and-Sequences/pull/3.

vk-github18 commented 6 months ago

In which use case does this matter? What is the benefit of the proposed change?

johnp commented 6 months ago

I'm currently working on a way to automatically derive DIN 91379 validators (at least regular expressions) from the raw set of characters. Since there's no other freely available source for the DIN 91379 characters and metadata, I'm parsing the data from your repository. While doing so, I noticed the described issue.

The benefit is having a correct, open-source data source for DIN 91379 characters. The only alternative I saw was parsing the Wikipedia tables, which would probably have been quite horrible.

vk-github18 commented 6 months ago

For this use case you can use the codepoint and ignore the shown character. May be. the characters of blnot should not be included in the files at all. As another option your employer could purchase the DIN, then you can work with the provided XML-File.

vk-github18 commented 6 months ago

Regular expressions can be found in https://xoev.de/schemata/din/91379/2022-08/din-norm-91379-datatypes.xsd

johnp commented 6 months ago

Good point, I could just use the code points.

I know the regular expressions that KoSIT created. They have ~severe~ performance issues and common regex engines (like the one in Java) run into stack overflows at, for Java, around 2000 characters. Since the KoSIT regexes are licensed under "Creative Commons Namensnennung - Keine Bearbeitung 4.0 International", changing them (edit: and then distributing those changes) is not allowed. That's why I'd like to base my work on another source, which isn't tainted by restrictive licensing (which I guess the DIN XML-file also has).

edit: strike the severe; it's afaict only affecting regex engines which use recursion, here's a decent article about it.

johnp commented 6 months ago

Feel free to close this issue & the PR if you prefer to keep the files as they are. I'll change my code to use the code points instead.

vk-github18 commented 6 months ago

I don't think, that it is worth the effort to try to get the invisible characters right, in a future version of the files I will replace them with REPLACEMENT CHARACTER U+FFFD. I added a hint, that the code points should be used.

String-Latin / DIN-91379-Characters-and-Sequences

Missing carriage return in latin_list_1.3.txt #2