Closed mateuszviste closed 9 months ago
I think, Int 21/AX=6506h is your friend here, although I never used this function myself.
This sounds right. An example of such a collating table for CP 850 can be found here: https://github.com/SvarDOS/edrdos/blob/9751c114b84df883956fa289a0142dfe54b57854/drdos/country.asm#L2037
It is interesting to see that this seems to be a case-insensitive ordering.
It is interesting to see that this seems to be a case-insensitive ordering.
While the FreeDOS one is case-sensitive: https://github.com/FDOS/country/blob/a170a5508430cd861754b9064d7e1a081d8b3101/country.asm#L3768
Thanks for your input, Bernd. I will look into implementing this in SvarCOM soon, it's relatively easy and I'm halfway there already. Worth noting that for FreeDOS the collation does not matter much, since FreeCOM is not supporting NLS sorting anyway :-P (it uses a simple strcmp() call)
Another, more annoying subject is that localcfg has no support for this collation business.
Committed in r1743
It seems to work, but I haven't tested it very much to be honest, as I do not use COUNTRY.SYS myself. @bttrx any chance you could check if this works alright on your setup with all these weird German letters of yours?
I tested on MS-DOS 6.22 + 'trunk' SvarCOM. Results are a bit strange.
At first, I created dirs ä, a, ö, o, u, ü, ß, s using MS COMMAND.COM.
COUNTRY=049,850,C:\DOS\COUNTRY.SYS
+ MS COMMAND.COM.
DIR /O:N order: Ä, A, Ö, O, ß, S, U, Ü
Why comes Ü after U, but Ä before A? I created these in different order.
In a new dir I did:
md a, md ä, dir /on -> aä
md ä, md a, dir /on -> äa
md u, md ü, dir /on -> uü
md ü, md u, dir /on -> üu
COUNTRY=049,850,C:\DOS\COUNTRY.SYS
+ SvarCOM.
DIR /O:N order: A, Ä, O, Ö, ß, S, Ü, U
In another new dir I did:
md a, md ä, dir /on -> aä
md ä, md a, dir /on -> äa
md u, md ü, dir /on -> üu (!)
md ü, md u, dir /on -> uü (!)
No COUNTRY line (= EN-US) + MS COMMAND.COM: DIR /O:N order: A, O, S, U, Ä, Ö, Ü, ß This is the expected order for EN-US.
No COUNTRY line (= EN-US) + SvarCOM: DIR /O:N order: A, Ä, O, Ö, ß, S, Ü, U That's unexpected, because it's the COUNTRY=049 order!
COUNTRY=001,437,C:\DOS\COUNTRY.SYS
(= EN-US) + SvarCOM:
DIR /O:N order: A, Ä, O, Ö, ß, S, Ü, U
That's unexpected again, because it's the COUNTRY=049 order!
md s ß u ü o ö a ä DIR /O:N order: A, Ä, Ö, O, S, ß, Ü, U (?) removed all dirs and created in the same order -> same result
Switched to MS COMMAND.COM DIR /O:N order: A, O, S, U, Ä, Ö, Ü, ß This is again the expected order for EN-US.
Why comes Ü after U, but Ä before A? I created these in different order.
a quick theory: is this because Ü and U have the same weight in your country.sys table? In the same manner, Ä might have the same weight as A. In such case, the order is random between these two, and it's the letter that comes after that will decide of the order of files.
more importantly: do you have different results with MS command.com ?
No COUNTRY line (= EN-US) + SvarCOM: DIR /O:N order: A, Ä, O, Ö, ß, S, Ü, U That's unexpected, because it's the COUNTRY=049 order!
That is unlikely, really. Are your sure you performed reboots between each of your tests? There is no way SvarCOM could invent the proper order.. All it does is ask the kernel for "current country/codepage sorting order". Unless you test on some German version of MSDOS, which comes with the default collation set to German?
if your results are reproductible, then maybe could you provide me with a boot floppy that has your exact NLS environment? I could then have a closer look at what happens exactly.
Now that I think of it, the behavior you describe does make sense to me. Independently of the "country" (1, 49, 33, or any other), since the currently selected codepage is able to display "Ü" I'd expect it to be always sorted like "U".
This is to say that maybe the "country" does not mean anything, the collating table is probably tied only to the codepage.
If you'd be keen on doing more tests, I think you could try replacing mov dx, 0xffff
from r1743 by an actual country value (1, 49...) and see if it changes anything. 0xffff is supposed to mean "current country", but maybe there is some different behavior if it is given explicitly.
In any case, I like the behavior you describe more than having the "en-US" sort being stupid about European glyphs. :)
Why comes Ü after U, but Ä before A? I created these in different order.
a quick theory: is this because Ü and U have the same weight in your country.sys table?
Dunno. Didn't have a look at the table so far and I'm also new to collation at all.
In the same manner, Ä might have the same weight as A. In such case, the order is random between these two, and it's the letter that comes after that will decide of the order of files.
Is it really random or does it depend on the order of creation on disk?
more importantly: do you have different results with MS command.com ?
Do you mean any randomness in the order? No, didn't notice any randomness.
No COUNTRY line (= EN-US) + SvarCOM: DIR /O:N order: A, Ä, O, Ö, ß, S, Ü, U That's unexpected, because it's the COUNTRY=049 order!
That is unlikely, really. Are your sure you performed reboots between each of your tests?
Yes.
There is no way SvarCOM could invent the proper order.. All it does is ask the kernel for "current country's sorting order". Unless you test on some German version of MSDOS, which comes with the default collation set to German?
I tested all this on a German version of MS-DOS, but why would it work correctly then with MS COMMAND.COM?
Now, I repeated one of those tests on an English version of MS-DOS 6.22. Same result.
MS COMMAND.COM dir /on
-> AOUÄÖU
'trunk' SvarCOM dir /on
-> ÄAÖOÜU
CONFIG.SYS:
[MENU]
MENUITEM=MSCOM,MS COMMAND.COM
MENUITEM=SVARCOM,SvarCOM
[MSCOM]
SHELL=C:\COMMAND.COM /P
[SVARCOM]
SHELL=C:\SVARCOM.COM /E:512 /P
[COMMON]
SWITCHES=/F
DEVICE=C:\DOS\SETVER.EXE
DEVICE=C:\DOS\HIMEM.SYS /TESTMEM:OFF /V
DOS=HIGH
FILES=30
AUTOEXEC.BAT:
C:\DOS\SMARTDRV.EXE /X
@ECHO OFF
PROMPT $p$g
PATH C:\DOS
SET TEMP=C:\DOS
EIDL.COM
I think you could try replacing mov dx, 0xffff from r1743 by an actual country value (1, 49...) and see if it changes anything.
No change. Also no change after replacing mov bx, 0xffff with mov bx, 437.
Is it really random or does it depend on the order of creation on disk?
It is sorted via quicksort, so the entries are shuffled around quite a bit, I'm not sure the on-disk order is always preserved in conflicting case, so I'd rather say "undefined behavior".
No change. Also no change after replacing mov bx, 0xffff with mov bx, 437.
Well, there isn't much more I could do then... I suppose this could be due to some hardcoded rule
if country=1 then do not bother with NLS and just rely on fast ASCII order
.
I do not see a problem having the sort rely on NLS all the time (as long as NLS is available, that is), and at least it makes for a consistent sorting experience across languages.
Unless you have some other ideas, I will check later today that the NLS sorting behaves well also in Polish and Russian and call this a feature.
I will check later today that the NLS sorting behaves well also in Polish and Russian and call this a feature.
I've set up an MS-DOS 6.0 VM (had to borrow the COUNTRY.SYS and EGA3.CPI from MS-DOS 6.22, though) and tested the collate sort order for CP852 and CP866: both behave the same with SvarCOM and MS COMMAND.COM when the COUNTRY is set to 048 and 007, respectively. For example:
All good.
But when the COUNTRY is NOT set, then things go south. MS COMMAND.COM orders files according to ASCII, which is not linguistically correct but fair enough given the circumstances:
SvarCOM, on the other hand, lists files in an order that makes no sense:
The above order is not ASCII, not alphabetic, and it's also not the order of files on disk. It does not seem to be a SvarCOM bug, because SvarCOM really does receive such collate table from the kernel, and the INT21h/AX=6506h call does not fail (CF is clear). Weird.
It is interesting to note that this order is the same for both PL and RU codepages. Noticing this, I changed my configuration and set COUNTRY=001,437,...
. And guess what: the order is still exactly the same!
So my working theory (speculation) is that when COUNTRY is not set or set to its default value (1), then the kernel falls back to a collate table designed for CP437. I do not know what are the rationale for this behavior, maybe there is a reason for this, or maybe it is a bug. Whatever the cause, it appears that NLS sorting should be disabled for "COUNTRY is 001" after all.
r1744 performs NLS sorting only when COUNTRY > 1.
This, I think, mimics what MS COMMAND does, and also avoids ending up with a wild sort order for non-437 languages when COUNTRY is not configured (because when COUNTRY is not configured, the kernel assumes COUNTRY=1 and proposes an CP437 collate).
I am not entirely convinced this is a good approach, because after all a missing COUNTRY is a configuration error that the user should fix, and besides - I really liked the elegant CP437 sorting being applied to U.S.... but if in doubt, it is probably safer to monkey whatever MS did 40 years ago.
@bttrx This should make the sort order work as you initially expected. Do you confirm?
Interesting findings! Have you tried checking the table size for being exactly 256? Currently there is a <= 256. Maybe the table contains simply "uninitialized" garbage. I am currently also on this topic but from an EDR kernel perspective. For EDR the case-insensitive standard collation is set by default even without a COUNTRY line in CONFIG.SYS. Would be interesting to see which table the MS-DOS kernel returns in the "default" case.
r1744 performs NLS sorting only when COUNTRY > 1
There may be a combination of country=1 and code page=850. The EDR country.sys contains this combination. In this case collating table is that of CP 850.
Have you tried checking the table size for being exactly 256?
Yes I did, the kernel always advertises the table as 256 bytes. But even if it was less, it would be no issue because then SvarCOM relies on ASCII sorting for whatever is not covered by the collate table.
Would be interesting to see which table the MS-DOS kernel returns in the "default" case.
It is basically a "common sense CP437" sorting that is case-insensitive, for example i = I = ï = î = ì = í. But here it is, I dumped it for you :)
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015
016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031
032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047
048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063
064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079
080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095
096 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079
080 081 082 083 084 085 086 087 088 089 090 123 124 125 126 127
067 085 069 065 065 065 065 067 069 069 069 073 073 073 065 065
069 065 065 079 079 079 085 085 089 079 085 036 036 036 036 036
065 073 079 085 078 078 166 167 063 169 170 171 172 033 034 034
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
224 083 226 227 228 229 230 231 232 233 234 235 236 237 238 239
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
There may be a combination of country=1 and code page=850. The EDR country.sys contains this combination. In this case collating table is that of CP 850.
The issue here is that MS-DOS returns "something" (the above collate table) for combinations that do not exist, like country=1 and page=866, which makes it difficult to trust anything when country is 1, as it's a default value...
But here it is, I dumped it for you :)
Thanks :-) Looks indeed like a valid collating table. For reference, this are the FreeDOS country.sys values for 437. I am bad at comparing, but this looks like the tables are equal. (posted the FreeDOS one because the EDR one is in hex :-P)
db 0, 1, 2, 3, 4, 5, 6, 7
db 8, 9, 10, 11, 12, 13, 14, 15
db 16, 17, 18, 19, 20, 21, 22, 23
db 24, 25, 26, 27, 28, 29, 30, 31
db 32, 33, 34, 35, 36, 37, 38, 39
db 40, 41, 42, 43, 44, 45, 46, 47
db 48, 49, 50, 51, 52, 53, 54, 55
db 56, 57, 58, 59, 60, 61, 62, 63
db 64, 65, 66, 67, 68, 69, 70, 71
db 72, 73, 74, 75, 76, 77, 78, 79
db 80, 81, 82, 83, 84, 85, 86, 87
db 88, 89, 90, 91, 92, 93, 94, 95
db 96, 65, 66, 67, 68, 69, 70, 71
db 72, 73, 74, 75, 76, 77, 78, 79
db 80, 81, 82, 83, 84, 85, 86, 87
db 88, 89, 90, 123, 124, 125, 126, 127
db 67, 85, 69, 65, 65, 65, 65, 67
db 69, 69, 69, 73, 73, 73, 65, 65
db 69, 65, 65, 79, 79, 79, 85, 85
db 89, 79, 85, 36, 36, 36, 36, 36
db 65, 73, 79, 85, 78, 78, 166, 167
db 63, 169, 170, 171, 172, 33, 34, 34
db 176, 177, 178, 179, 180, 181, 182, 183
db 184, 185, 186, 187, 188, 189, 190, 191
db 192, 193, 194, 195, 196, 197, 198, 199
db 200, 201, 202, 203, 204, 205, 206, 207
db 208, 209, 210, 211, 212, 213, 214, 215
db 216, 217, 218, 219, 220, 221, 222, 223
db 224, 83, 226, 227, 228, 229, 230, 231
db 232, 233, 234, 235, 236, 237, 238, 239
db 240, 241, 242, 243, 244, 245, 246, 247
db 248, 249, 250, 251, 252, 253, 254, 255
and this is what MS-DOS returns for COUNTRY=1 / CP=850. (indeed, a different set)
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015
016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031
032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047
048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063
064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079
080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095
096 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079
080 081 082 083 084 085 086 087 088 089 090 123 124 125 126 127
067 085 069 065 065 065 065 067 069 069 069 073 073 073 065 065
069 065 065 079 079 079 085 085 089 079 085 079 036 079 158 036
065 073 079 085 078 078 166 167 063 169 170 171 172 033 034 034
176 177 178 179 180 065 065 065 184 185 186 187 188 036 036 191
192 193 194 195 196 197 065 065 200 201 202 203 204 205 206 036
068 068 069 069 069 073 073 073 073 217 218 219 220 221 073 223
079 083 079 079 079 079 230 232 232 085 085 085 089 089 238 239
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
But again, this table is NOT used by command.com for the combination COUNTRY=1 / CP=850. Instead, ASCII sort is applied (just like for the combination COUNTRY=1 / CP=437). But as soon as I switch to COUNTRY=33 / CP=850, the above sort table is not only proposed by the kernel, but also applied by command.com.
So while the kernel seem to have some options for nice COUNTRY=1 sorting, MS COMMAND.COM prefers ignoring them.
So while the kernel seem to have some options for nice COUNTRY=1 sorting, MS COMMAND.COM prefers ignoring them.
Do you want SvarCOM to be bug-for-bug compatible? :-D
MS COMMAND.COM prefers ignoring them
As additional data point: 4DOS does not seem to respect country and code page at all for sorting. Just tried it with my current SvarDOS install. But it outputs its listing in lowercase by default. Which fails on german umlauts :-)
So while the kernel seem to have some options for nice COUNTRY=1 sorting, MS COMMAND.COM prefers ignoring them.
Do you want SvarCOM to be bug-for-bug compatible? :-D
No, and this is why at first I was happy to keep NLS sorting for US codepages, despite Robert's complaints. :) But then I made more tests and I realized that the (MS) kernel does not fail the INT21h/AX=6506h call when it does not have a proper collate table, and instead provides this "CP437" table as a fallback. Combined with the fact that COUNTRY=1 is used both for US and for "country unknown" situations, it is very easy to end up with a totally messed up sorting. Which is probably (I assume) the reason that MS COMMAND prefers to flatly ignore anything with country=1...
I'm not sure what to do on this, and would need to make more tests to compare how it works with the FreeDOS and EDR kernels. But for now, having no certainty I preferred to opt for following MS's cautious choice so I can push SvarCOM 2024.2 out. Then there will always be time to reconsider options.
4DOS does not seem to respect country and code page at all for sorting. Just tried it with my current SvarDOS
SvarDOS might not be a good test candidate, as it comes with a very limited COUNTRY.SYS, with no collation tables and no upcase tables. Maybe that's the reason 4DOS fails on the umlauts?
SvarDOS might not be a good test candidate, as it comes with a very limited COUNTRY.SYS, with no collation tables and no upcase tables. Maybe that's the reason 4DOS fails on the umlauts?
It is SvarDOS using EDR and its COUNTRY.SYS I am running. I have not looked into the 4DOS source yet. But my assumption is that it simply does not make use of the INT21,65xx functions (at least for sorting).
Regarding the conversion to lower case, which leads to something like abÄd.txt
in 4DOS dir output, I think it simply does the non NLS-enabled standard case conversion.
I noticed that the EDR country.sys has upcase conversion tables but no lower case tables. MS-DOS country.sys seems to have some lower case tables since 6.22 according to RBIL, but incomplete. This makes conversion to lower case harder than conversion to upper case, I think. Perhaps one can convert the upper case table to a lower case table? Should be possible if the mapping is bijective.
I'm not sure what to do on this, and would need to make more tests to compare how it works with the FreeDOS and EDR kernels. But for now, having no certainty I preferred to opt for following MS's cautious choice so I can push SvarCOM 2024.2 out. Then there will always be time to reconsider options.
Better play safe 👍
Perhaps one can convert the upper case table to a lower case table? Should be possible if the mapping is bijective.
It is not, because due to space limitation of a single codepage, not all glyphs are available in both upper and lower cases. For example in CP437 there is the french "è" but not its upcase version, so the upcase conversion is "è -> E". Same situation happens with many other glyphs.
Checking for COUNTRY=1 and ignoring NLS sorting is a no-go after all, because the FreeDOS kernel returns an error "invalid function number" to the call INT 21h/AX=6501h (and that's the call I need to discover the current COUNTRY).
Hence the "if country==1 then ignore NLS" hack is not only ugly, but not possible anyway with SvarDOS' current default kernel. I will therefore remove this hack and we will have to live with the fact that DIR collation will be very weird for users that set a non-437 codepage but forget to set a proper COUNTRY setting.
PS. when compiled with "-DDIR_DUMPNLSCOLLATE", SvarCOM will show a dump of the NLS collate table on screen, on top of every DIR output. It is one line to uncomment in the makefile.
Closing this, for the time I do not see any better approach than applying NLS sorting unconditionally. I believe it is the most elegant solution, even though it differs from MSDOS' behavior.
This follows #11
@bttrx writes: