Cynosureprime / rling

RLI Next Gen (Rling), a faster multi-threaded, feature rich alternative to rli found in hashcat utilities.
MIT License
81 stars 11 forks source link

rling -q cw drops some records during counting #47

Open roycewilliams opened 1 month ago

roycewilliams commented 1 month ago

Example: Here's top X freqcount data of TLDs from a domain dump, using a perl script:

198174606:com
16123338:net
13445547:org
8481074:top
5972916:xyz
4899113:info
4250349:online
3173095:shop
2285882:site
1977737:store
1616455:app
1592094:biz
1142129:icu
1126259:vip

... but no matter what I do, rling -q cw's topX starts here:

  Count Line
 2285882 site
 1977737 store
 1616455 app
 1592094 biz
 1126259 vip
 1017868 cfd 

A potentially related mismatch can be reproduced with a file containing only the same string:

$ yes | head -n 4M >yes4m.list
$ wc -l yes4m.list 
4194304 yes4m.list
$ rling -q cw yes4m.list stdout
Reading "yes4m.list"...8388608 bytes total in 0.0064 seconds
Counting lines...Found 4194304 lines in 0.1082 seconds
Estimated memory required: 213,909,536 (204.00Mbytes)
Sorting... took 0.0000 seconds
Frequency:  1 unique (4194303 duplicate lines) in 0.3733 seconds

0 total lines matched in 0.3733 seconds
Input file had 4,194,304 lines, with lengths from 1 to 1
Writing analysis to "stdout"
   Count Line

Wrote 1 lines in 0.0000 seconds
Total runtime 0.4880 seconds

There's also a threshold at 100,000 as well somehow:

$ rm test.dat; yes 'ww' | head -n 100000 >test.dat; yes 'ff' | head -n 100000 >>test.dat; rling -q cw test.dat
 stdout 2>/dev/null
   Count Line

$ rm test.dat; yes 'ww' | head -n 10000 >test.dat; yes 'ff' | head -n 10000 >>test.dat; rling -q cw test.dat stdout 2>/dev/null
   Count Line
   10000 ff

$ count=99999; rm test.dat; yes 'ww' | head -n ${count} >test.dat; yes 'ff' | head -n ${count} >>test.dat; rling -q cw test.dat stdout 2>/dev/null
   Count Line
   99999 ff

$ count=100000; rm test.dat; yes 'ww' | head -n ${count} >test.dat; yes 'ff' | head -n ${count} >>test.dat; rling -q cw test.dat stdout 2>/dev/null
   Count Line