cruizperez / MicrobeAnnotator

Pipeline for metabolic annotation of microbial genomes
Artistic License 2.0
133 stars 27 forks source link

Critical bugs in KOfam annotation step #94

Closed ivagljiva closed 5 months ago

ivagljiva commented 5 months ago

I’m Iva (@ivagljiva), writing on behalf of myself and my colleague Kat (@Kekananen). We were looking into this pipeline for unrelated reasons when we stumbled across a couple of bugs. Unfortunately, after investigating and testing further to make sure the problem was not on our end, we determined that these bugs critically affect the KOfam annotations that MicrobeAnnotator reports. However, we have a fix ready to go, which we'll be posting in a PR linked to this issue.

To summarize, there are two major bugs in the MicrobeAnnotator code, specifically the part that filters the HMMER results from the initial search against the KOfam database (in the hmmsearch.py module). These bugs are:

1) only the first KO model in a given subset is used to obtain the bit score threshold for filtering out weak HMM hits within the hmmer_filter() function (by 'subset', we mean one of the MicrobeAnnotator_DB/kofam_data/profiles/*.model files). This means that a majority of the KOfam hits are filtered using the wrong threshold, leading to a large number of false positive annotations. 2) when identifying the best match to a given gene in the best_match_selector() function, a string comparison (rather than a numeric comparison) of the bit scores is done, leading to incorrect assignment of the best match in some cases.

We identified these issues by looking at the kofam_results/*kofam and kofam_results/*filt output files reported by the pipeline on our test genomes. We'll show the output from a publicly-available genome, Bradyrhizobium manausense BR3351 (NCBI RefSeq GCF_001440035.1), to help explain what is going on. Note that this output is based on the KOfam database downloaded with microbeannotator_db_builder from December 15, 2023.

Bug 1: filtering hits with incorrect threshold

In going through the *.filt output file, we noticed that some hits to a given KO were below the bit score threshold for that KO. For instance, here are a couple of hits to K00119, which has a (full) threshold of 510.70 in our KOfam database:

$ head microbeannotator_output/Bradyrhizobium_manausense_BR3351/kofam_results/Bradyrhizobium_manausense_BR3351_proteins.fasta.kofam.filt
#                   ----    full_sequence   ----    ----    best_1_domain   ----    --- --- --- domain  number  estimation  --- --- description_of_target
# query_name    accession   target_name accession   score_type  E-value score   bias    E-value score   bias    exp reg clu ov  env dom rep inc description_of_target
# -----------   ---------   ----------  ---------   ----------  ------- -----   ----    ------- -----   ----    --- --- --- --  --- --- --- --- ---------------------
6333    -   K00003  -   domain  1.4e-152    509.1   17.8    1.5e-152    508.9   17.8    1.0 1   0   0   1   1   1   1   homoserine dehydrogenase [EC:1.1.1.3]
1672    -   K00119  -   full    5.2e-23 80.6    8.9 4.2e-21 74.4    8.9 2.5 1   1   0   1   1   1   1   alcohol dehydrogenase (nicotinoprotein) [EC:1.1.99.36]
4973    -   K00119  -   full    2.3e-28 98.3    6.4 8.9e-28 96.3    6.4 1.8 1   1   0   1   1   1   1   alcohol dehydrogenase (nicotinoprotein) [EC:1.1.99.36]

The two hits have full bit scores of 80.6 and 98.3, respectively, both of which are much less than 510.70.

We looked into the code and realized that the following line from the hmmer_filter() function is the issue:

model_name = hmmsearch_result[0].query.name

The model name is extracted only from the initial HMM hit in the hmmsearch_result variable that is passed to the function. However, this variable contains the hits to multiple KOs, specifically all KOs in a given MicrobeAnnotator_DB/kofam_data/profiles/*.model, since each CPU thread works on an individual *.model file at a time (based on this part of the codebase). And since that model name is subsequently used to identify the bit score threshold used to filter all hits within hmmsearch_result, that means a majority of hits are filtered using the wrong threshold.

We added a print statement to the hmmer_filter() function that prints the model information each time a different model and threshold are loaded, like this:

def hmmer_filter(
    hmmsearch_result: List[hmmer.tbl.TBLRow]) -> List[hmmer.tbl.TBLRow]:
    # Initialize list with filtered results
    hmmsearch_result_filt = []
    # Get model name, threshold and score_type
    model_name = hmmsearch_result[0].query.name
    threshold = hmm_model_info[model_name].threshold
    score_type = hmm_model_info[model_name].score_type
    print(f"Filtering hits for model {model_name} with {score_type} threshold {threshold}")  ## ADDED
    (....)

When we ran the modified code on our test genome and saved the output to a log file, we counted only 27 of these print statements in the log:

$ grep "Filtering hits for model" microbeannotator.log
Filtering hits for model K03088 with domain threshold 97.23
Filtering hits for model K17081 with full threshold 414.0
Filtering hits for model K16413 with - threshold -
Filtering hits for model K00001 with domain threshold 379.83
Filtering hits for model K01363 with domain threshold 228.63
Filtering hits for model K21811 with full threshold 449.23
Filtering hits for model K21536 with full threshold 148.83
Filtering hits for model K25071 with full threshold 765.63
Filtering hits for model K00790 with domain threshold 172.07
Filtering hits for model K07037 with full threshold 218.93
Filtering hits for model K08071 with full threshold 378.6
Filtering hits for model K03446 with domain threshold 350.63
Filtering hits for model K09726 with full threshold 179.77
Filtering hits for model K24847 with full threshold 674.23
Filtering hits for model K05540 with full threshold 282.07
Filtering hits for model K17241 with full threshold 288.03
Filtering hits for model K19168 with domain threshold 24.5
Filtering hits for model K11354 with full threshold 460.53
Filtering hits for model K18130 with full threshold 215.6
Filtering hits for model K10022 with domain threshold 343.67
Filtering hits for model K16649 with full threshold 152.1
Filtering hits for model K08979 with full threshold 166.07
Filtering hits for model K18304 with full threshold 218.1
Filtering hits for model K19429 with full threshold 223.6
Filtering hits for model K23661 with full threshold 205.5
Filtering hits for model K15950 with full threshold 1043.53
Filtering hits for model K11061 with domain threshold 563.5
$ grep -c "Filtering hits for model" microbeannotator.log
27

The KOs in these print statements correspond to the first KO in each *.model file:

$ for f in MicrobeAnnotator_DB/kofam_data/profiles/common*.model MicrobeAnnotator_DB/kofam_data/profiles/prokaryote*.model MicrobeAnnotator_DB/kofam_data/profiles/independent*.model; do head $f | grep NAME; done
NAME  K00001
NAME  K01363
NAME  K03088
NAME  K07994
NAME  K17081
NAME  K00790
NAME  K18130
NAME  K18304
NAME  K09725
NAME  K08979
NAME  K15950
NAME  K11061
NAME  K16649
NAME  K19429
NAME  K23661
NAME  K02595
NAME  K11354
NAME  K24847
NAME  K05540
NAME  K03446
NAME  K17241
NAME  K09726
NAME  K11008
NAME  K13477
NAME  K25279
NAME  K15947
NAME  K15995

Going back to our initial example of K00119, that KO is stored within prokaryote_2.model, the first KO in that set is K02595, and K02595 has a threshold of 50.27, which explains why those two hits were included in our output.

$ grep K00119 MicrobeAnnotator_DB/kofam_data/profiles/*.model
MicrobeAnnotator_DB/kofam_data/profiles/prokaryote_2.model:NAME  K00119
$ head MicrobeAnnotator_DB/kofam_data/profiles/prokaryote_2.model | grep NAME
NAME  K02595
$ grep K02595 MicrobeAnnotator_DB/kofam_data/ko_list
K02595  50.27   full    all 1.000000    424 370 296 128 3.81    0.590   nitrogenase-stabilizing/protective protein

It seems like this happened because hmmer_filter() was initially designed to handle one KO at a time, maybe before the pipeline was multi-threaded using these *.model files?

This bug can cause both false positives (keeping hits that are too weak given the true threshold as defined by KEGG) and false negatives (removing strong hits given the true threshold) in the annotation results. We counted the number of false positive annotations produced for our test genome with a simple Python script that simply checks each hit in *.filt output file against the true bitscore threshold, and we found 2,852 false positive annotations out of 4,366 total annotations, which is 65%. Note that our tests on other, smaller genomes yielded false positive percentages of 47% to 62%.

Here is the script for checking the number of false positives:

#!/usr/bin/env python
# A script to count the number of false positive annotations from MicrobeAnnotator output
# Usage: python count_false_positives.py PATH_TO_ANNOTATION_OUTPUT_DIRECTORY
# see below for expected directory structure of PATH_TO_ANNOTATION_OUTPUT_DIRECTORY

import os
import sys

# command-line argument
PATH_TO_ANNOTATION_OUTPUT=sys.argv[1]     # path to the microbeannotator output files (each test genome has its own directory inside this one)

## CHANGE THESE INPUTS FOR ALTERNATIVE DIRECTORY SETUPS
PATH_TO_MICROBEANNOTATOR_DB="../../MicrobeAnnotator_DB/" # path to the database set up by microbeannotator_db_builder
# this list holds the name of each genome annotated by microbeannotater, to be used in building the file names later
# as in "Bradyrhizobium_manausense_BR3351/kofam_results/Bradyrhizobium_manausense_BR3351_proteins.fasta.kofam.filt"
genomes_to_process = ["Bradyrhizobium_manausense_BR3351"]

ko_list = os.path.join(PATH_TO_MICROBEANNOTATOR_DB, "kofam_data/ko_list")
ko_dict = {}
ko_file = open(ko_list, 'r')
for line in ko_file.readlines():
   line = line.strip()
   entries = line.split()
   if entries[0] == 'knum':
     continue
   else:
     if entries[1] == '-':
       ko_dict[entries[0]] = {'threshold': None, 'score_type': entries[2]}
     else:
       ko_dict[entries[0]] = {'threshold': float(entries[1]), 'score_type': entries[2]}

for g in genomes_to_process:
  filt_output_file = os.path.join(PATH_TO_ANNOTATION_OUTPUT, f"{g}/kofam_results/{g}_proteins.fasta.kofam.filt")
  incorrect_annotations = []
  with open(filt_output_file, 'r') as filt:
    for line in filt.readlines():
      if line.startswith("#"):
        continue;
      else:
        hit_data = line.strip().split()
        ko = hit_data[2]
        full_bitscore = float(hit_data[6])
        domain_bitscore = float(hit_data[9])
        if ko_dict[ko]['threshold'] is None:
          continue;
        if ko_dict[ko]['score_type'] == 'full' and full_bitscore < ko_dict[ko]['threshold']:
          incorrect_annotations.append((ko, full_bitscore, ko_dict[ko]['threshold']))
        elif ko_dict[ko]['score_type'] == 'domain' and domain_bitscore < ko_dict[ko]['threshold']:
          incorrect_annotations.append((ko, domain_bitscore, ko_dict[ko]['threshold']))
  print(f"for {g}, the number of filtered annotations with bitscore less than the threshold is {len(incorrect_annotations)}")

Note that we didn't quantify the false negatives since that is a more involved process.

Bug 2: string comparison of bit scores for 'best match' identification

We compared the *.kofam output to the *.filt output and realized that the annotation within *.filt was not always the hit with highest bit score selected from the possibilities within *.kofam for a given gene. For instance, one of the weak annotations for K00119 in *.filt was supposedly the 'best match' for gene 1672. But here are all the hits for gene 1672 within the *.kofam file:

$ grep 1672 microbeannotator_output/Bradyrhizobium_manausense_BR3351/kofam_results/Bradyrhizobium_manausense_BR3351_proteins.fasta.kofam
1672    -   K00004  -   domain  1.7e-122    407.9   0.3 1.9e-122    407.8   0.3 1.0 1   0   0   1   1   1   1   (R,R)-butanediol dehydrogenase / meso-butanediol dehydrogenase / diacetyl reductase [EC:1.1.1.4 1.1.1.- 1.1.1.303]
1672    -   K05351  -   domain  2.2e-77 259.6   0.1 3.4e-77 259.0   0.1 1.2 1   0   0   1   1   1   1   D-xylulose reductase [EC:1.1.1.9]
1672    -   K23493  -   -   2.5e-08 32.2    1.0 7.8e-08 30.6    1.0 1.7 1   1   0   1   1   1   1   geissoschizine synthase [EC:1.3.1.36]
1672    -   K14577  -   -   1.3e-09 36.0    1.0 0.00038 18.1    0.1 2.1 2   0   0   2   2   2   2   aliphatic (R)-hydroxynitrile lyase [EC:4.1.2.46]
1672    -   K08322  -   domain  1.2e-85 286.5   0.7 1.4e-85 286.3   0.7 1.0 1   0   0   1   1   1   1   L-gulonate 5-dehydrogenase [EC:1.1.1.380]
1672    -   K20434  -   full    6.4e-15 53.7    5.4 1.6e-14 52.4    4.3 2.0 2   1   0   2   2   2   1   cyclitol reductase
1672    -   K00119  -   full    5.2e-23 80.6    8.9 4.2e-21 74.4    8.9 2.5 1   1   0   1   1   1   1   alcohol dehydrogenase (nicotinoprotein) [EC:1.1.99.36]
1672    -   K19961  -   full    6.8e-42 142.7   0.3 7.9e-42 142.5   0.3 1.0 1   0   0   1   1   1   1   6-hydroxyhexanoate dehydrogenase [EC:1.1.1.258]
1672    -   K05352  -   full    2.6e-16 58.3    0.0 3.4e-16 57.9    0.0 1.0 1   0   0   1   1   1   1   ribitol-5-phosphate 2-dehydrogenase (NADP+) [EC:1.1.1.405]
1672    -   K21190  -   full    2.1e-30 104.9   3.5 2.8e-30 104.5   3.5 1.1 1   0   0   1   1   1   1   dehydrogenase
1672    -   K18125  -   full    1.7e-21 75.7    0.7 8.7e-20 70.0    0.7 2.0 1   1   0   1   1   1   1   aldose 1-dehydrogenase [NAD(P)+] [EC:1.1.1.359]
1672    -   K13548  -   domain  9.5e-56 188.1   5.8 1.4e-55 187.5   5.8 1.2 1   0   0   1   1   1   1   2-deoxy-scyllo-inosamine dehydrogenase [EC:1.1.1.329]
1672    -   K14465  -   domain  3.6e-59 199.2   8.5 2.4e-39 134.0   1.4 2.2 1   1   1   2   2   2   2   succinate semialdehyde reductase (NADPH) [EC:1.1.1.-]

Clearly, the 'best match' should have been K00004, with the highest bit score of 407.9, and not K00119 with a bit score of 80.6.

We again added print statements and ran a portion of the best_match_selector() function in the Python terminal, with our test genome's *.kofam file as input, to see what was happening. Here is the code we ran:

# note that we set raw_results to be equal to a *.kofam output file first

with open(raw_results, 'r') as infile:
  best_matches = {};
  for line in infile:
    record = line.strip().split('\t');
    if record[0] not in best_matches:
      best_matches[record[0]] = [record[6], line];
    else:
      if record[6] > best_matches[record[0]][0]:
        print(f"old record has {best_matches[record[0]][0]}, new has {record[6]}")  ## ADDED
        print(f"type of old record: {type(best_matches[record[0]][0])}, type of new: {type(record[6])}")  ## ADDED
        best_matches[record[0]] = [record[6], line];

And here is one example from the printed output that showcases the problem:

old record has 117.8, new has 23.8
type of old record: <class 'str'>, type of new: <class 'str'>

This is a string comparison, not a numeric comparison, which means that the 'best match' is actually chosen based on alphabetical order of the bit scores rather than their relative value, and the two comparison types don't always yield the same result.

This bug causes weaker hits to sometimes be selected as the best match for a given gene, and it compounds the effect of the first bug. We wrote another script to count the number of times this happens in our test genomes. For B. manausense, we found 1,692 incorrect best matches out of 4,366 genes, for a percentage of ~39%. In our other tests, the percentage ranged from 21% to 35%.

Here is the script for checking the number of incorrect best matches:

#!/usr/bin/env python
# A script to count the number of false positive annotations from MicrobeAnnotator output
# Usage: python count_false_best_matches.py PATH_TO_ANNOTATION_OUTPUT_DIRECTORY
# see below for expected directory structure of PATH_TO_ANNOTATION_OUTPUT_DIRECTORY

import os
import sys

# command-line argument
PATH_TO_ANNOTATION_OUTPUT=sys.argv[1]     # path to the microbeannotator output files (each test genome has its own directory inside this one)

## CHANGE THESE INPUTS FOR ALTERNATIVE DIRECTORY SETUPS
PATH_TO_MICROBEANNOTATOR_DB="../../MicrobeAnnotator_DB/" # path to the database set up by microbeannotator_db_builder
# this list holds the name of each genome annotated by microbeannotater, to be used in building the file names later
# as in "Bradyrhizobium_manausense_BR3351/kofam_results/Bradyrhizobium_manausense_BR3351_proteins.fasta.kofam.filt"
genomes_to_process = ["Bradyrhizobium_manausense_BR3351",]

for g in genomes_to_process:
  kofam_output_file = os.path.join(PATH_TO_ANNOTATION_OUTPUT, f"{g}/kofam_results/{g}_proteins.fasta.kofam")
  filt_output_file = os.path.join(PATH_TO_ANNOTATION_OUTPUT, f"{g}/kofam_results/{g}_proteins.fasta.kofam.filt")
  # get the reported best matches
  filt_annotations = {}
  with open(filt_output_file, 'r') as filt:
    for line in filt.readlines():
      if line.startswith("#"):
        continue
      else:
        hit_data = line.strip().split()
        gene = hit_data[0]
        ko = hit_data[2]
        filt_annotations[gene] = ko
  # get the true best matches
  best_matches = {}
  with open(kofam_output_file, 'r') as kof:
    for line in kof.readlines():
      if line.startswith("#"):
        continue
      else:
        hit_data = line.strip().split()
        gene = hit_data[0]
        ko = hit_data[2]
        full_bitscore = float(hit_data[6])
        if gene not in best_matches:
          best_matches[gene] = (ko, full_bitscore)
        else:
          prev_bitscore = best_matches[gene][1]
          if full_bitscore > prev_bitscore:
            best_matches[gene] = (ko, full_bitscore)
  # compare true to reported
  genes_with_wrong_best_match = []
  for gene, best_match_tuple in best_matches.items():
    best_ko = best_match_tuple[0]
    if filt_annotations[gene] != best_ko:
      genes_with_wrong_best_match.append(gene)
  print(f"for {g}, the number of genes with the wrong best match annotation is {len(genes_with_wrong_best_match)}")

Conclusion

In summary, these bugs produce incorrect KOfam annotations (which may potentially be ameliorated by downstream annotations from Swissport, RefSeq, etc. in later steps of the pipeline, but can also still end up in the final annotation table). Luckily, these bugs ended up being straightforward to fix and accordingly, we are submitting a PR. We hope you will consider merging this PR to the main branch of MicrobeAnnotator.

bdaisley commented 5 months ago

@ivagljiva @Kekananen - I just came across your pull request while also digging through the code in an attempt to understand why the KOfam results from microbeannotator did not match results from other software. Major thanks for sharing these bug fixes, huge time saver!

I performed validation tests with the python scripts provided in your post, using the Parvimonas micra ATCC 33270 type strain genome (GCF_000154405.1) as an example.

count_false_positives.py results relevant to fix for Bug 1

Before fix (Using "cruizperez/MicrobeAnnotator" branch):

for GCF_000154405.1, the number of filtered annotations with bitscore less than the threshold is 318

After fix (Using "ivagljiva/MicrobeAnnotator/tree/bug_fixes" branch):

for GCF_000154405.1, the number of filtered annotations with bitscore less than the threshold is 0
False positive annotations Total annotations % false positives
Before fix 318 707 44.9%
After fix 0 943 0.0 %

count_false_best_matches.py results relevant to fix for Bug 2

Before fix (Using "cruizperez/MicrobeAnnotator" branch):
for GCF_000154405.1, the number of genes with the wrong best match annotation is 159
After fix (Using "ivagljiva/MicrobeAnnotator/tree/bug_fixes" branch):
for GCF_000154405.1, the number of genes with the wrong best match annotation is 0
Incorrect best matches Total annotations % incorrect best matches
Before fix 159 707 22.5%
After fix 0 943 0.0 %

Looks like everything in the pull request works as planned, relevant to commits 6602246 2d25a62 89f4c88

meren commented 5 months ago

Wow. Thank you very much for your follow-up investigation and sharing your findings and confirmation, @bdaisley! You rock.

cruizperez commented 5 months ago

Hi Iva! Thank you so much for finding this huge bug and submitting a PR to fix it. I am reviewing the commits, and they look good, thank you so much; I only had a small comment on https://github.com/cruizperez/MicrobeAnnotator/commit/89f4c88441679dbde8427da9aab0001c6f1a6520. I also apologize for the delay on my replies, I've been away from the repo for too long. Maybe this is a good starting point for a refactoring :).

ivagljiva commented 5 months ago

No problem at all! Happy to help ☺️ Thanks for merging!