Open andreyto opened 12 years ago
Upon testing sqlite and python.anydbm, it is apparent that both do not scale beyond a few Ms of sequences. So, we proceed with an alternative plan: sort the BLAST output across all rank files in the order of query IID (in the original input order, in other words) while it is being generated. Merging with the defline when becomes a simple sequential scan of both defline and blast hits files. Parallel sort with MPI in general: Parallel bubble sort (odd-even transposition): http://www.waset.org/journals/waset/v75/v75-59.pdf It looks like a good match for MPI MapReduce BLAST output because the data is already partitioned before sorting and output. The efficiency might drop, however, at such a large cpu count. We can use MR-MPI gather to collect data to a sub-set of CPUs
But because our ID/key is sequential integer number, we do not need even that. We can just use hash sort. In the simplest implementation, the master can gather array[max_id] = item_lengh from the slaves (or do it symmetrically to all nodes). Then, use it to build array[max_id] = destination_rank, such that destination_rank is monotonous. And use that array as a hash function for the collate() or aggregate() (see the man page: http://www.sandia.gov/~sjplimp/mapreduce/doc/aggregate.html). Aggregate() is not a stable operation, so a local sort will have to be done after it. But it does not have to be done before the aggregate - only the local part of the array[max_id] = item_length needs to be built when, which can be done by a sequential scan through the key-value (or key-multivalue) pairs at each rank building an array for the locally owned query IDs, followed by mpi reduce/broadcast (there is a single mpi function for this, I believe). That array can be accumulated within the map() that calls BLAST itself. So, the calls should look like this:
All done.
The goal is to be able to combine into a single dataset output produced by indexing and blasting several non-intersecting fasta files