Closed ivagljiva closed 5 months ago
@ivagljiva @Kekananen - I just came across your pull request while also digging through the code in an attempt to understand why the KOfam results from microbeannotator did not match results from other software. Major thanks for sharing these bug fixes, huge time saver!
I performed validation tests with the python scripts provided in your post, using the Parvimonas micra ATCC 33270 type strain genome (GCF_000154405.1) as an example.
count_false_positives.py
results relevant to fix for Bug 1for GCF_000154405.1, the number of filtered annotations with bitscore less than the threshold is 318
for GCF_000154405.1, the number of filtered annotations with bitscore less than the threshold is 0
False positive annotations | Total annotations | % false positives | |
---|---|---|---|
Before fix | 318 | 707 | 44.9% |
After fix | 0 | 943 | 0.0 % |
count_false_best_matches.py
results relevant to fix for Bug 2for GCF_000154405.1, the number of genes with the wrong best match annotation is 159
for GCF_000154405.1, the number of genes with the wrong best match annotation is 0
Incorrect best matches | Total annotations | % incorrect best matches | |
---|---|---|---|
Before fix | 159 | 707 | 22.5% |
After fix | 0 | 943 | 0.0 % |
Looks like everything in the pull request works as planned, relevant to commits 6602246 2d25a62 89f4c88
Wow. Thank you very much for your follow-up investigation and sharing your findings and confirmation, @bdaisley! You rock.
Hi Iva! Thank you so much for finding this huge bug and submitting a PR to fix it. I am reviewing the commits, and they look good, thank you so much; I only had a small comment on https://github.com/cruizperez/MicrobeAnnotator/commit/89f4c88441679dbde8427da9aab0001c6f1a6520. I also apologize for the delay on my replies, I've been away from the repo for too long. Maybe this is a good starting point for a refactoring :).
No problem at all! Happy to help ☺️ Thanks for merging!
I’m Iva (@ivagljiva), writing on behalf of myself and my colleague Kat (@Kekananen). We were looking into this pipeline for unrelated reasons when we stumbled across a couple of bugs. Unfortunately, after investigating and testing further to make sure the problem was not on our end, we determined that these bugs critically affect the KOfam annotations that MicrobeAnnotator reports. However, we have a fix ready to go, which we'll be posting in a PR linked to this issue.
To summarize, there are two major bugs in the MicrobeAnnotator code, specifically the part that filters the HMMER results from the initial search against the KOfam database (in the
hmmsearch.py
module). These bugs are:1) only the first KO model in a given subset is used to obtain the bit score threshold for filtering out weak HMM hits within the
hmmer_filter()
function (by 'subset', we mean one of theMicrobeAnnotator_DB/kofam_data/profiles/*.model
files). This means that a majority of the KOfam hits are filtered using the wrong threshold, leading to a large number of false positive annotations. 2) when identifying the best match to a given gene in thebest_match_selector()
function, a string comparison (rather than a numeric comparison) of the bit scores is done, leading to incorrect assignment of the best match in some cases.We identified these issues by looking at the
kofam_results/*kofam
andkofam_results/*filt
output files reported by the pipeline on our test genomes. We'll show the output from a publicly-available genome, Bradyrhizobium manausense BR3351 (NCBI RefSeq GCF_001440035.1), to help explain what is going on. Note that this output is based on the KOfam database downloaded withmicrobeannotator_db_builder
from December 15, 2023.Bug 1: filtering hits with incorrect threshold
In going through the
*.filt
output file, we noticed that some hits to a given KO were below the bit score threshold for that KO. For instance, here are a couple of hits to K00119, which has a (full) threshold of 510.70 in our KOfam database:The two hits have full bit scores of 80.6 and 98.3, respectively, both of which are much less than 510.70.
We looked into the code and realized that the following line from the
hmmer_filter()
function is the issue:The model name is extracted only from the initial HMM hit in the
hmmsearch_result
variable that is passed to the function. However, this variable contains the hits to multiple KOs, specifically all KOs in a givenMicrobeAnnotator_DB/kofam_data/profiles/*.model
, since each CPU thread works on an individual*.model
file at a time (based on this part of the codebase). And since that model name is subsequently used to identify the bit score threshold used to filter all hits withinhmmsearch_result
, that means a majority of hits are filtered using the wrong threshold.We added a print statement to the
hmmer_filter()
function that prints the model information each time a different model and threshold are loaded, like this:When we ran the modified code on our test genome and saved the output to a log file, we counted only 27 of these print statements in the log:
The KOs in these print statements correspond to the first KO in each
*.model
file:Going back to our initial example of K00119, that KO is stored within
prokaryote_2.model
, the first KO in that set is K02595, and K02595 has a threshold of 50.27, which explains why those two hits were included in our output.It seems like this happened because
hmmer_filter()
was initially designed to handle one KO at a time, maybe before the pipeline was multi-threaded using these*.model
files?This bug can cause both false positives (keeping hits that are too weak given the true threshold as defined by KEGG) and false negatives (removing strong hits given the true threshold) in the annotation results. We counted the number of false positive annotations produced for our test genome with a simple Python script that simply checks each hit in
*.filt
output file against the true bitscore threshold, and we found 2,852 false positive annotations out of 4,366 total annotations, which is 65%. Note that our tests on other, smaller genomes yielded false positive percentages of 47% to 62%.Here is the script for checking the number of false positives:
Note that we didn't quantify the false negatives since that is a more involved process.
Bug 2: string comparison of bit scores for 'best match' identification
We compared the
*.kofam
output to the*.filt
output and realized that the annotation within*.filt
was not always the hit with highest bit score selected from the possibilities within*.kofam
for a given gene. For instance, one of the weak annotations for K00119 in*.filt
was supposedly the 'best match' for gene 1672. But here are all the hits for gene 1672 within the*.kofam
file:Clearly, the 'best match' should have been K00004, with the highest bit score of 407.9, and not K00119 with a bit score of 80.6.
We again added print statements and ran a portion of the
best_match_selector()
function in the Python terminal, with our test genome's*.kofam
file as input, to see what was happening. Here is the code we ran:And here is one example from the printed output that showcases the problem:
This is a string comparison, not a numeric comparison, which means that the 'best match' is actually chosen based on alphabetical order of the bit scores rather than their relative value, and the two comparison types don't always yield the same result.
This bug causes weaker hits to sometimes be selected as the best match for a given gene, and it compounds the effect of the first bug. We wrote another script to count the number of times this happens in our test genomes. For B. manausense, we found 1,692 incorrect best matches out of 4,366 genes, for a percentage of ~39%. In our other tests, the percentage ranged from 21% to 35%.
Here is the script for checking the number of incorrect best matches:
Conclusion
In summary, these bugs produce incorrect KOfam annotations (which may potentially be ameliorated by downstream annotations from Swissport, RefSeq, etc. in later steps of the pipeline, but can also still end up in the final annotation table). Luckily, these bugs ended up being straightforward to fix and accordingly, we are submitting a PR. We hope you will consider merging this PR to the main branch of MicrobeAnnotator.