Open solardiz opened 4 years ago
I have identified and replicated the issue. The core of the problem is that rling splits the file into large "chunks", and processes these on multiple cores at the same time. For example, in your test file, the word "svn7" appears at line 62541, 43312836, 71731224, 71733302 and 71749022. Depending on the number of cores (threads) in use, the later uses of the word "svn7" may be processed prior to the "earlier" line numbers. There, of course, is no issue with the file actually being re-ordered, just that any duplicates may be dropped, not necessarily the later ones in the file. I was able to see this behaviour on several different systems, and in all cases the correct number of lines were output - all without duplication.
All of that said, the implication that "first in file wins" is the principle of least astonishment, and there will be a change to the code to implement this (though I may offer a switch, as it is significantly faster to process the file as cores become available, rather than waiting for a previous block to complete prior to starting the next run.
Input file generation:
all.lst
is from https://download.openwall.net/pub/wordlists/all.gz (MD5 f7b3b76d15bbb95fcb267ea6be108cce),john
is current bleeding-jumbo with its defaultjohn.conf
. The resultingall.lst-rules-with-dupes
is 173188126 lines, 2037345891 bytes (MD5 4c221f4df353aae89bdcd6888e92887a).These commands produce the same unique lines, but in different order:
t1
is the same as what JtR'sunique
program produces,t2
isn't.Edit: more detail:
t2
changes between command invocations. This is on Scientific Linux 6.10 (so old glibc, and I had to add-lrt
forclock_gettime
to be found). I tried with two gcc versions (system detault gcc 4.4.7 and devtoolset-8 gcc 8.2.1) - same behavior.