Closed roycewilliams closed 1 year ago
If rling worked anything like a typical program, this might be possible. It doesn't. The "read" and "parse lines" are completely decoupled, because you can parse lines in parallel, using all of the cores in a given system. Reading, at least on current machines, can occupy much of the memory bandwidth, thus leaving "reading" in a buffered format the best option in testing.
What would be good is a simple de-duplicator program, that could be used as part of a pipeline (as you well point out). Try writing a program that does basic de-dupe on a line basis - using just stdin/stdout. Permitting unlimited length lines, like rling, would be a bonus. This can be written in C, or Go, or any language you are comfortable in. Even the perl code may be sufficiently fast.
When data is large and also highly redundant, sending all records to a single invocation of rling can consume a very large amount of memory. In my testing, deduplicating 'chunks' of the data - say, 500,000 records at a time - before passing it to the full final rling can dramatically reduce memory usage in exchange for slower processing time - and in the case of very duplicated data, processing sometimes seems to take less total time than a single direct rling pass, when sorting, etc.
Doing this "pre-deduplication" externally with Perl, or
parallel
calls to rling itself, works OK - but there may be significant efficiency benefits with native / direct support (in the form of an optional flag).