applied-bioinformatics / An-Introduction-To-Applied-Bioinformatics

Interactive lessons in bioinformatics.
http://readIAB.org
Other
805 stars 316 forks source link

local_alignment_search todo comment #255

Open corburn opened 7 years ago

corburn commented 7 years ago

The following is a snippet from 2.2.4 A complete homology search function:

Spend a minute looking at this function to understand what it's doing.

# then we reverse-sort them by score, and return the n highest
# scoring alignments (this needs to be updated so we only
# ever keep track of the n highest scoring alignments)
best_hits = sorted(hits, key=lambda e: e[1], reverse=True)[:n]

The comment says the code 'needs to be updated to keep track of the highest scoring alignments', but it appears this is already accomplished by slicing the list with[:n]. Is this an old comment that can be removed or is the goal to replace the hits list with something like a heap such that only the n highest scoring elements are kept in memory at any one time?

Assuming option 2, the following should restrict memory usage to the top n results:

from heapq import nlargest

def yield_hits(query, reference_db):
    for reference in reference_db:
        aln, score, _ = aligner(query, reference)
        yield [reference.metadata['id'], score, aln, reference.metadata['taxonomy']]

best_hits = nlargest(n, yield_hits(query, reference_db), key=lambda e: e[1])

If option 2 looks good I will submit a pull request.

https://github.com/caporaso-lab/An-Introduction-To-Applied-Bioinformatics/blob/e1e4beb1750b5be179470ee37fd42c55bce9f889/iab/algorithms/__init__.py#L633-L636