Closed johnp closed 6 months ago
The lines look different, because latin_list_1.2.txt is has MSDOS line endings, whereas latin_list_1.3.txt has Unix line endings. The character 0D is not missing.
I did indeed miss the difference in line endings. It still appears to me that the 0d
byte is entirely missing in latin_list_1.3.txt, and apparently has been converted to a 0d0a
in the latin_list_1.2.txt. I've verified this by directly downloading the files to avoid any git autocrlf conversion on my side.
curl -s https://raw.githubusercontent.com/String-Latin/DIN-91379-Characters-and-Sequences/333031c992cc97b518f32e027fcce3c652e334c3/latin_list_1.2.txt | head -n 736 | tail -n 4 | xxd
00000000: 0d0a 626e 6c6e 6f74 3b20 6368 6172 3b20 ..bnlnot; char; 00000010: 3030 3044 3b20 4341 5252 4941 4745 2052 000D; CARRIAGE R 00000020: 4554 5552 4e20 2843 5229 3b20 0d0a 0d0a ETURN (CR); .... 00000030: 626e 6c6e 6f74 3b20 6368 6172 3b20 3030 bnlnot; char; 00 00000040: 4130 3b20 4e4f 2d42 5245 414b 2053 5041 A0; NO-BREAK SPA 00000050: 4345 3b20 c2a0 0d0a CE; ....
Notice the 0d0a 0d0a
sequence. This should, if I understand the file format correctly, actually be 0d0d 0a
. (first the actual character the line describes, and then the MSDOS line ending, just like the next line ending with c2a0 0d0a
(nbsp followed by MSDOS line ending))
Presumably, some line ending conversion logic seems to have added 0a
after the 0d
. This also happened the other way around: The Line Feed (LF) line also ends with 0d0a 0d0a
, although it is supposed to end with 0a0d 0a
.
curl -s https://raw.githubusercontent.com/String-Latin/DIN-91379-Characters-and-Sequences/333031c992cc97b518f32e027fcce3c652e334c3/latin_list_1.3.txt | head -n 738 | tail -n 3 | xxd
00000000: 0a62 6e6c 6e6f 743b 2063 6861 723b 2030 .bnlnot; char; 0 00000010: 3030 443b 2043 4152 5249 4147 4520 5245 00D; CARRIAGE RE 00000020: 5455 524e 2028 4352 293b 200a 626e 6c6e TURN (CR); .bnln 00000030: 6f74 3b20 6368 6172 3b20 3030 4130 3b20 ot; char; 00A0; 00000040: 4e4f 2d42 5245 414b 2053 5041 4345 3b20 NO-BREAK SPACE; 00000050: c2a0 0a ...
Here there's no 0d
at all, just the 0a
Unix line ending. Presumably, line ending conversion dropped it.
Of course I don't know where the line ending conversion took place. I've opened a PR at https://github.com/String-Latin/DIN-91379-Characters-and-Sequences/pull/3.
In which use case does this matter? What is the benefit of the proposed change?
I'm currently working on a way to automatically derive DIN 91379 validators (at least regular expressions) from the raw set of characters. Since there's no other freely available source for the DIN 91379 characters and metadata, I'm parsing the data from your repository. While doing so, I noticed the described issue.
The benefit is having a correct, open-source data source for DIN 91379 characters. The only alternative I saw was parsing the Wikipedia tables, which would probably have been quite horrible.
For this use case you can use the codepoint and ignore the shown character. May be. the characters of blnot should not be included in the files at all. As another option your employer could purchase the DIN, then you can work with the provided XML-File.
Regular expressions can be found in https://xoev.de/schemata/din/91379/2022-08/din-norm-91379-datatypes.xsd
Good point, I could just use the code points.
I know the regular expressions that KoSIT created. They have ~severe~ performance issues and common regex engines (like the one in Java) run into stack overflows at, for Java, around 2000 characters. Since the KoSIT regexes are licensed under "Creative Commons Namensnennung - Keine Bearbeitung 4.0 International", changing them (edit: and then distributing those changes) is not allowed. That's why I'd like to base my work on another source, which isn't tainted by restrictive licensing (which I guess the DIN XML-file also has).
edit: strike the severe; it's afaict only affecting regex engines which use recursion, here's a decent article about it.
Feel free to close this issue & the PR if you prefer to keep the files as they are. I'll change my code to use the code points instead.
I don't think, that it is worth the effort to try to get the invisible characters right, in a future version of the files I will replace them with REPLACEMENT CHARACTER U+FFFD. I added a hint, that the code points should be used.
Hi,
in
latin_list_1.3.txt
the carriage return line is missing the carriage return (0D byte) at the end of the line:https://github.com/String-Latin/DIN-91379-Characters-and-Sequences/blob/333031c992cc97b518f32e027fcce3c652e334c3/latin_list_1.3.txt#L737
latin_list_1.2.txt
correctly contains the 0D byte at that location:https://github.com/String-Latin/DIN-91379-Characters-and-Sequences/blob/333031c992cc97b518f32e027fcce3c652e334c3/latin_list_1.2.txt#L734
Maybe git's autocrlf feature dropped it (autocrlf should maybe be disabled for the .txt files in this repo via
.gitattributes
?).