cbg-ethz / shorah

Repo for the software suite ShoRAH (Short Reads Assembly into Haplotypes)
GNU General Public License v3.0
41 stars 14 forks source link

Performance improvements #29

Closed sposadac closed 7 years ago

sposadac commented 7 years ago

Main changes are listed below

  1. Intel intrinsic for counting bits, POPCNT, has been added as an alternative way of computing Hamming distances between reads. It is used only when the intrinsic is available, otherwise previous implementation is used.

  2. The limit on the number of reads which can be group together as unique reads has been extended from 100 to 100000. The gain on performance is at the expense of increased memory. However, this allows to run ShoRAH on large datasets (over million reads, e.g. using HiSeq) on a reasonable time scale. On the other hand, when the read depth is of the order of ten thousand, max. memory is of the order of 27 GB.

  3. Cluster sizes taking into account read weights (unique reads represent one or more identical reads) is updated as reads are added or removed.

  4. Corrected reads are merged using multiple threads (multiprocessing module).

  5. Seeds are added to ensure reproducibility