... but no matter what I do, rling -q cw's topX starts here:
Count Line
2285882 site
1977737 store
1616455 app
1592094 biz
1126259 vip
1017868 cfd
A potentially related mismatch can be reproduced with a file containing only the same string:
$ yes | head -n 4M >yes4m.list
$ wc -l yes4m.list
4194304 yes4m.list
$ rling -q cw yes4m.list stdout
Reading "yes4m.list"...8388608 bytes total in 0.0064 seconds
Counting lines...Found 4194304 lines in 0.1082 seconds
Estimated memory required: 213,909,536 (204.00Mbytes)
Sorting... took 0.0000 seconds
Frequency: 1 unique (4194303 duplicate lines) in 0.3733 seconds
0 total lines matched in 0.3733 seconds
Input file had 4,194,304 lines, with lengths from 1 to 1
Writing analysis to "stdout"
Count Line
Wrote 1 lines in 0.0000 seconds
Total runtime 0.4880 seconds
There's also a threshold at 100,000 as well somehow:
$ rm test.dat; yes 'ww' | head -n 100000 >test.dat; yes 'ff' | head -n 100000 >>test.dat; rling -q cw test.dat
stdout 2>/dev/null
Count Line
$ rm test.dat; yes 'ww' | head -n 10000 >test.dat; yes 'ff' | head -n 10000 >>test.dat; rling -q cw test.dat stdout 2>/dev/null
Count Line
10000 ff
$ count=99999; rm test.dat; yes 'ww' | head -n ${count} >test.dat; yes 'ff' | head -n ${count} >>test.dat; rling -q cw test.dat stdout 2>/dev/null
Count Line
99999 ff
$ count=100000; rm test.dat; yes 'ww' | head -n ${count} >test.dat; yes 'ff' | head -n ${count} >>test.dat; rling -q cw test.dat stdout 2>/dev/null
Count Line
Example: Here's top X freqcount data of TLDs from a domain dump, using a perl script:
... but no matter what I do, rling -q cw's topX starts here:
A potentially related mismatch can be reproduced with a file containing only the same string:
There's also a threshold at 100,000 as well somehow: