Spend a minute looking at this function to understand what it's doing.
# then we reverse-sort them by score, and return the n highest
# scoring alignments (this needs to be updated so we only
# ever keep track of the n highest scoring alignments)
best_hits = sorted(hits, key=lambda e: e[1], reverse=True)[:n]
The comment says the code 'needs to be updated to keep track of the highest scoring alignments', but it appears this is already accomplished by slicing the list with[:n]. Is this an old comment that can be removed or is the goal to replace the hits list with something like a heap such that only the n highest scoring elements are kept in memory at any one time?
option 1: remove the comment
option 2: replace the list with a heap
option n: ???
Assuming option 2, the following should restrict memory usage to the top n results:
from heapq import nlargest
def yield_hits(query, reference_db):
for reference in reference_db:
aln, score, _ = aligner(query, reference)
yield [reference.metadata['id'], score, aln, reference.metadata['taxonomy']]
best_hits = nlargest(n, yield_hits(query, reference_db), key=lambda e: e[1])
If option 2 looks good I will submit a pull request.
The following is a snippet from 2.2.4 A complete homology search function:
The comment says the code 'needs to be updated to keep track of the highest scoring alignments', but it appears this is already accomplished by slicing the list with
[:n]
. Is this an old comment that can be removed or is the goal to replace the hits list with something like a heap such that only then
highest scoring elements are kept in memory at any one time?Assuming option 2, the following should restrict memory usage to the top n results:
If option 2 looks good I will submit a pull request.
https://github.com/caporaso-lab/An-Introduction-To-Applied-Bioinformatics/blob/e1e4beb1750b5be179470ee37fd42c55bce9f889/iab/algorithms/__init__.py#L633-L636