Open organisciak opened 9 years ago
How you know the optimal number of files for sort -m
? This paper here seems to indicate that sort -m
may internally already be doing a hierarchical merge of the sort you suggest. (In the second clause of the second sentence--it implies the opposite in the first clause, so I don't know).
In contrast, Gnu Sort [1] creates a buffer for each open file, and then repeatedly scans all the buffers to find the smallest key, and then outputs that line. Thus Gnu Sort implements a priority queue where each operation takes time O(N) for each line, and so Gnu Sort constructs a merge tree in which 16 files are merged into a bigger file, and then 16 of those larger files are merged into a bigger file, and so forth. Hence the merge phase of a terabyte sort would require two to three passes over the data. Because TokuMergeSort uses an efficient priority queue, it can merge many hundreds of files in a single pass.
Edit: but that's without the counting step to reduce duplicate records, I see. Still, I wonder what the tradeoffs are.
Thanks! I wasn't sure about performance of sort -m at huge scales, its generally very robust. Oddly, when I tried a large merge it gave a null output: the reason for the bug was unclear but I haven't been able to recreate it with smaller data. The other reason for my concern is disk space, although sort seem to use the tmp dir specified for Gnu Parallel, so that can be managed.
On Sat, Mar 21, 2015, 2:28 PM Benjamin Schmidt notifications@github.com wrote:
How you know the optimal number of files for sort -m? This paper here http://sortbenchmark.org/tokupenny.pdf seems to indicate that sort -m may internally already be doing a hierarchical merge of the sort you suggest. (In the second clause of the second sentence--it implies the opposite in the first clause, so I don't know).
In contrast, Gnu Sort [1] creates a buffer for each open file, and then repeatedly scans all the buffers to find the smallest key, and then outputs that line. Thus Gnu Sort implements a priority queue where each operation takes time O(N) for each line, and so Gnu Sort constructs a merge tree in which 16 files are merged into a bigger file, and then 16 of those larger files are merged into a bigger file, and so forth. Hence the merge phase of a terabyte sort would require two to three passes over the data. Because TokuMergeSort uses an efficient priority queue, it can merge many hundreds of files in a single pass.
— Reply to this email directly or view it on GitHub https://github.com/Bookworm-project/BookwormDB/issues/63#issuecomment-84455155 .
The way fast_featurecounter.sh works is:
The concern is that step two's merge-sort-sum pattern may need to run more than once, else the final merge in step three may potentially be too large. Without a recursive step #2, for a 3TB input there would be 334 files to merge by step three. Ideally we want:
(psuedocode)
I've been doing this manually if the HTRC dataset, because I worry about Bash debugging time before Uncamp.