Open PenguinKeeper7 opened 2 years ago
@PenguinKeeper7 's example shows that he's setting LC_ALL=C for his run of sort.
@hops could I impose upon you to look at this one briefly, as time allows? I'm not clear about what the root cause is.
This indeed appears to be an issue, I've seen it on Linux with special characters. Here is an example just with the 2nd line containing a tab.
echo $'testing\ntesting\t' > file1
LC_ALL=C sort file1 > file1_sorted
rling -2 file1_sorted NUL
File "file1_sorted" is not in sorted order at line 2
Line 1: testing
Line 2: testing0x090x09
Not a full fix per-se, but this fixes the tab character, lmk if other characters are issues as well and we can look at fixing those too.
int mystrcmp(const char *a, const char *b) {
const unsigned char *s1 = (const unsigned char *) a;
const unsigned char *s2 = (const unsigned char *) b;
unsigned char c1, c2;
do
{
c1 = (unsigned char) *s1++;
if (c1 < 10)
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c2 < 10)
c2 = (unsigned char) *s2++;
if (c1 == '\n')
return c1 - c2;
}
while (c1 == c2);
return c1 - c2;
}
This is a better fix. but the real issue is also that the sort function isn't correctly sorting.
echo $'testing\ntesting\x03\ntesting\x02' > file1
./rling file1 file1_rling
hexdump -c file1_rling
0000000 t e s t i n g \n t e s t i n g 003
0000010 \n t e s t i n g 002 \n
000001a
I pushed a new fix a while ago but forgot to clarify that this PR should resolve this issue entirely
Above PR does help some situations but doesn't fix it entirely, so the issue is still very open
$ cat /dev/random | head -n 50000 > test.txt
$ LC_ALL=C sort test.txt -o test2.txt
$ ./rling -2 test2.txt test3.txt
Estimated memory required: 52,429,024 (50.00Mbytes)
Allocated in 0.0563 seconds
Start processing input "test2.txt"
File "test2.txt" is not in sorted order at line 205
Line 204: 0x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x00
Line 205: 0x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x00
It seems sort and rling disagree on how to deal with empty lines, any ideas? (Tested on Windows, sort used through git bash & wsl)
Test file: https://anonfiles.com/ZfEcEfR5uf/testFile_txt