Cynosureprime / rling

RLI Next Gen (Rling), a faster multi-threaded, feature rich alternative to rli found in hashcat utilities.
MIT License
77 stars 11 forks source link

sort/rling sorting disagreement #30

Open PenguinKeeper7 opened 2 years ago

PenguinKeeper7 commented 2 years ago

It seems sort and rling disagree on how to deal with empty lines, any ideas? (Tested on Windows, sort used through git bash & wsl)

$ LC_ALL=C sort testFile.txt > testFile2.txt

./rling -2 "testFile2.txt" NUL
...
File "testFile2.txt" is not in sorted order at line 2
Line 1:
Line 2: 0x020x02☻

Test file: https://anonfiles.com/ZfEcEfR5uf/testFile_txt

0xVavaldi commented 2 years ago

https://github.com/Cynosureprime/rling/issues/25#issuecomment-887741423

roycewilliams commented 2 years ago

@PenguinKeeper7 's example shows that he's setting LC_ALL=C for his run of sort.

roycewilliams commented 2 years ago

@hops could I impose upon you to look at this one briefly, as time allows? I'm not clear about what the root cause is.

flaggx1 commented 2 years ago

This indeed appears to be an issue, I've seen it on Linux with special characters. Here is an example just with the 2nd line containing a tab.

echo $'testing\ntesting\t' > file1
LC_ALL=C sort file1 > file1_sorted
rling -2 file1_sorted NUL

File "file1_sorted" is not in sorted order at line 2
Line 1: testing
Line 2: testing0x090x09
0xVavaldi commented 2 years ago

Not a full fix per-se, but this fixes the tab character, lmk if other characters are issues as well and we can look at fixing those too.

0xVavaldi commented 2 years ago
int mystrcmp(const char *a, const char *b) {
  const unsigned char *s1 = (const unsigned char *) a;
  const unsigned char *s2 = (const unsigned char *) b;
  unsigned char c1, c2;
      do
        {
          c1 = (unsigned char) *s1++;
          if (c1 < 10)
              c1 = (unsigned char) *s1++;
          c2 = (unsigned char) *s2++;
          if (c2 < 10)
              c2 = (unsigned char) *s2++;
          if (c1 == '\n')
            return c1 - c2;
        }
      while (c1 == c2);
      return c1 - c2;
}

This is a better fix. but the real issue is also that the sort function isn't correctly sorting.

echo $'testing\ntesting\x03\ntesting\x02' > file1
./rling file1 file1_rling
hexdump -c file1_rling

0000000   t   e   s   t   i   n   g  \n   t   e   s   t   i   n   g 003
0000010  \n   t   e   s   t   i   n   g 002  \n
000001a
0xVavaldi commented 2 years ago

I pushed a new fix a while ago but forgot to clarify that this PR should resolve this issue entirely

PenguinKeeper7 commented 1 year ago

Above PR does help some situations but doesn't fix it entirely, so the issue is still very open

$ cat /dev/random | head -n 50000 > test.txt
$ LC_ALL=C sort test.txt -o test2.txt
$ ./rling -2 test2.txt test3.txt
Estimated memory required: 52,429,024 (50.00Mbytes)
Allocated in 0.0563 seconds
Start processing input "test2.txt"
File "test2.txt" is not in sorted order at line 205
Line 204: 0x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x00
Line 205: 0x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x00