Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Make fast_featurecounter recursive #63

Open organisciak opened 9 years ago

organisciak commented 9 years ago

The way fast_featurecounter.sh works is:

  1. Reads "id \t token \t count" input files in X size (I use 300M) blocks, keeping only the "token \t count". Each of these block is sorted, then consecutive lines with the same token are merged. The blocks are saved as temp files, now about 300 and a file list is output.
  2. The list of sorted temp files is read, and they are merged in 30 file batches using sort -m (merged sort). Again, consecutive lines with the same token are folded (e.g. "cat \t 30 \n cat \t 20" becomes "cat \n 50"), saved as temporary files, and piped to the next processes.
  3. The final temp files are sorted and merged into one big wordlist file.
  4. The wordlist files is sorted by the frequency column and numbered.

The concern is that step two's merge-sort-sum pattern may need to run more than once, else the final merge in step three may potentially be too large. Without a recursive step #2, for a 3TB input there would be 334 files to merge by step three. Ideally we want:

(psuedocode)

STEP1 >listOfTmpFiles.txt
while true:
   cat listOfTmpFiles.txt | STEP2 >newList.txt
   if (count(newList) < 100):
        breack
  listOfTmpFiles.txt = newList.txt

I've been doing this manually if the HTRC dataset, because I worry about Bash debugging time before Uncamp.

bmschmidt commented 9 years ago

How you know the optimal number of files for sort -m? This paper here seems to indicate that sort -m may internally already be doing a hierarchical merge of the sort you suggest. (In the second clause of the second sentence--it implies the opposite in the first clause, so I don't know).

In contrast, Gnu Sort [1] creates a buffer for each open file, and then repeatedly scans all the buffers to find the smallest key, and then outputs that line. Thus Gnu Sort implements a priority queue where each operation takes time O(N) for each line, and so Gnu Sort constructs a merge tree in which 16 files are merged into a bigger file, and then 16 of those larger files are merged into a bigger file, and so forth. Hence the merge phase of a terabyte sort would require two to three passes over the data. Because TokuMergeSort uses an efficient priority queue, it can merge many hundreds of files in a single pass.

Edit: but that's without the counting step to reduce duplicate records, I see. Still, I wonder what the tradeoffs are.

organisciak commented 9 years ago

Thanks! I wasn't sure about performance of sort -m at huge scales, its generally very robust. Oddly, when I tried a large merge it gave a null output: the reason for the bug was unclear but I haven't been able to recreate it with smaller data. The other reason for my concern is disk space, although sort seem to use the tmp dir specified for Gnu Parallel, so that can be managed.

On Sat, Mar 21, 2015, 2:28 PM Benjamin Schmidt notifications@github.com wrote:

How you know the optimal number of files for sort -m? This paper here http://sortbenchmark.org/tokupenny.pdf seems to indicate that sort -m may internally already be doing a hierarchical merge of the sort you suggest. (In the second clause of the second sentence--it implies the opposite in the first clause, so I don't know).

In contrast, Gnu Sort [1] creates a buffer for each open file, and then repeatedly scans all the buffers to find the smallest key, and then outputs that line. Thus Gnu Sort implements a priority queue where each operation takes time O(N) for each line, and so Gnu Sort constructs a merge tree in which 16 files are merged into a bigger file, and then 16 of those larger files are merged into a bigger file, and so forth. Hence the merge phase of a terabyte sort would require two to three passes over the data. Because TokuMergeSort uses an efficient priority queue, it can merge many hundreds of files in a single pass.

— Reply to this email directly or view it on GitHub https://github.com/Bookworm-project/BookwormDB/issues/63#issuecomment-84455155 .