feature: option to "pre-deduplicate" for highly redundant input

Cynosureprime / rling

RLI Next Gen (Rling), a faster multi-threaded, feature rich alternative to rli found in hashcat utilities.

MIT License

81 stars 11 forks source link

When data is large and also highly redundant, sending all records to a single invocation of rling can consume a very large amount of memory. In my testing, deduplicating 'chunks' of the data - say, 500,000 records at a time - before passing it to the full final rling can dramatically reduce memory usage in exchange for slower processing time - and in the case of very duplicated data, processing sometimes seems to take less total time than a single direct rling pass, when sorting, etc.

Doing this "pre-deduplication" externally with Perl, or parallel calls to rling itself, works OK - but there may be significant efficiency benefits with native / direct support (in the form of an optional flag).

If rling worked anything like a typical program, this might be possible. It doesn't. The "read" and "parse lines" are completely decoupled, because you can parse lines in parallel, using all of the cores in a given system. Reading, at least on current machines, can occupy much of the memory bandwidth, thus leaving "reading" in a buffered format the best option in testing.

What would be good is a simple de-duplicator program, that could be used as part of a pipeline (as you well point out). Try writing a program that does basic de-dupe on a line basis - using just stdin/stdout. Permitting unlimited length lines, like rling, would be a bonus. This can be written in C, or Go, or any language you are comfortable in. Even the perl code may be sufficiently fast.

Cynosureprime / rling

feature: option to "pre-deduplicate" for highly redundant input #45