Create facility for safely merging output from several blast runs

andreyto commented 12 years ago

The goal is to be able to combine into a single dataset output produced by indexing and blasting several non-intersecting fasta files

Require that IIDs for each run are generated with offsets so that they do not overlap
Implement a procedure that merges the defline index files into one file, checking for uniqueness of the IIDs along the way 2.1 Replace the current combination of (text defline index file + in-memory dictionary that maps ID to the lin ein file) with a file-based implementation that can handle 10^9 records. test both anydbm and sqlite; grab existing text defline index file, convert to anydbm or sqlite, read ids from the text file, randomize order, iterate through all pulling the defline from the new db, time it. Replicate the text file N times with new IIDs, time access again. Use the implementation that is faster with a preference for sqlite due to better portability and generality. 2.2 Write a script that merges the defline databases into one. If sqlite, use the "attach" command and "insert into ... select" (search MGTAXA code for the example of using "attach"), and always create the index only after the output file was written (create index ... unique).
After the defline index merge has ran successfully, the data conversion procedure runs on bin files using the merged defline index

andreyto commented 12 years ago

Upon testing sqlite and python.anydbm, it is apparent that both do not scale beyond a few Ms of sequences. So, we proceed with an alternative plan: sort the BLAST output across all rank files in the order of query IID (in the original input order, in other words) while it is being generated. Merging with the defline when becomes a simple sequential scan of both defline and blast hits files. Parallel sort with MPI in general: Parallel bubble sort (odd-even transposition): http://www.waset.org/journals/waset/v75/v75-59.pdf It looks like a good match for MPI MapReduce BLAST output because the data is already partitioned before sorting and output. The efficiency might drop, however, at such a large cpu count. We can use MR-MPI gather to collect data to a sub-set of CPUs

But because our ID/key is sequential integer number, we do not need even that. We can just use hash sort. In the simplest implementation, the master can gather array[max_id] = item_lengh from the slaves (or do it symmetrically to all nodes). Then, use it to build array[max_id] = destination_rank, such that destination_rank is monotonous. And use that array as a hash function for the collate() or aggregate() (see the man page: http://www.sandia.gov/~sjplimp/mapreduce/doc/aggregate.html). Aggregate() is not a stable operation, so a local sort will have to be done after it. But it does not have to be done before the aggregate - only the local part of the array[max_id] = item_length needs to be built when, which can be done by a sequential scan through the key-value (or key-multivalue) pairs at each rank building an array for the locally owned query IDs, followed by mpi reduce/broadcast (there is a single mpi function for this, I believe). That array can be accumulated within the map() that calls BLAST itself. So, the calls should look like this:

build the index and supply it to the custom hash_function
aggregate(hash_function)
sort_keys(by quiery ID)
convert()
sort_multivalues(by e-value) will result in key-multivalue globally sorted in our ultimate output order. A possible optimization is to also sort_keys()+convert() before the aggregate() if that optimizes the communication.

sulsj commented 12 years ago

All done.

andreyto / mr-mpi-blast

Create facility for safely merging output from several blast runs #3