biointec / columba

Fast Approximate Pattern Matching using Search Schemes
GNU Affero General Public License v3.0
6 stars 2 forks source link

Use unordered_set to fix quadratic running time in FASTA parsing #9

Closed jnalanko closed 1 month ago

jnalanko commented 1 month ago

In FASTA parsing, sequence names were stored in a std::vector, and every new sequence name was searched versus every previous sequence name with std::find. This took O(n^2) time total, where n is the number of sequences, making the parsing extremely slow in my use case, where the index is a set of reads.

In this patch, I add an std::unordered_set of sequence names for O(1) sequence name lookup, reducing the running time to O(n).

lrenders commented 1 month ago

Thank you for the patch! I hadn't considered the use case where the index is a set of reads, so your patch is a welcome improvement.