cutoffs for warning with f_ident

taylorreiter commented 4 years ago

Currently, if f_ident is < 10%, the user receives a warning: too few identifiable hashes; f_ident < 10%. provide a lineage for this genome. Given that this parameter was selected randomly, we may need to adjust it. Digging into charcoal results for some microbiomes derived from human gut microbiomes, we observe the following three examples. From these three examples, I recommend that we increase the f_major required for a charcoal to run when f_ident is low. I propose a pipeline below to investigate the "best" parameters to use as defaults. For now, maybe something along the lines of f_major * f_ident => .05 for charcoal to proceed?

SRR4033063_bin.66

brieftax: genus g__Clostridium f_major: 0.25824176 f_ident: 0.10016511

Digging into *report.txt, we see that genbank accession GCA_000435835.1 is the best match. If we run nucmer + circos for visualization, we see that the bin and this genome from genus Clostridium have very few nucleotides in common, likely indicating that clostridium is actually a contaminant in this bin. The small blue track with the red arrow is the bin in this case, while the outer edge is the reference genome. SRR4033063_bin 66_nucmer2circos

GTDB-tk assigns this bin to d__Bacteria;p__Firmicutes;c__Bacilli;o__RFN20;f__CAG-449;g__CAG-449.

SRR2857885_bin.40

brieftax: genus g__Intestinibacter f_major: 0.87025316 f_ident: 0.10501828

Digging into *report.txt, we see that genbank accession GCA_000435835.1 is the best match. If we run nucmer + circos for visualization, we see that the bin and this genome from genus intestinibacter have many nucleotides stretches with high percent identity, likely indicating that Intestinibacter is the correct identity of this bin. The small blue track is the bin in this case, while the outer edge is the reference genome. SRR2857885_bin 40_nucmer2circos

GTDB-tk assigns this bin to d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Peptostreptococcales;f__Peptostreptococcaceae;g__Intestinibacter, or the same lineage as charcoal.

SRR1793410_bin.1

brieftax: genus g__Klebsiella_A f_major: 0.3556962 f_ident: 0.15665279

SRR1793410_bin.1 is between the two above cases, with a slighly higher f_ident but fairly low f_major. Digging into *report.txt, we see that genbank accession GCA_002925905.1 is the best match. If we run nucmer + circos for visualization, we see that the bin and this genome from genus g__Klebsiella_A have many nucleotides stretches with high percent identity, likely indicating that Ig__Klebsiella_A is the correct identity of this bin. The small blue track is the bin in this case, while the outer edge is the reference genome. SRR1793410_bin 1_nucmer2circos

GTDB-tk assigns this bin to d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Klebsiella_A, or the same lineage as charcoal.

Proposed pipeline to continue investigating these parameters

To generate these figures, I grabbed the best clean genome match from gather at the bottom of the *results.txt file, ran nucmer + circos, and then ran GTDBtk on the bin to validate lineage. We could determine what threshold f_major * f_ident by using nucmer results and calculating # bp > XX% percent identity between bin and reference / total bp in reference`where we would expect ~>70% similarity to be considered the same genus?

The benefit of nucmer is it is fast, while gtdbtk is quite slow to run at scale. We could adopt the cutoffs used by gtdbtk though for amino acid or nucleotide percent identity required for genus level classification.

taylorreiter commented 4 years ago

mummer + circos is cool, but it might be too heavy weight for this. Here are fastANI results

SRR4033063_bin.66.fna    GCA_000435835.1_MGS221_genomic.fna    NA
SRR2857885_bin.40.fa    GCA_000154445.1_ASM15444v1_genomic.fna  87.6559 744     991
SRR1793410_bin.1.fna    GCA_002925905.1_ASM292590v1_genomic.fna 86.4279 1242    1696

it might be sufficient to compare fastani with f_major * f_ident to develop a cutoff, but this also uses k-mers/sketching, so not sure this is the best idea?

taylorreiter commented 4 years ago

Amino acid identity could be a good things to use, but predicting AA seqs + calculating AAI might be computationally expensive

ctb commented 4 years ago

for evaluation/setting thresholds, I think heavyweight methods are fine, but I'm not sure we should trust them to tell us about contamination?

it does seem circular to use ANI, or at least we'd have to be careful.

what I did in sourmash-oddify was use nucmer to align putative contaminant to candidate reference genome. that seemed reliable-ish.

taylorreiter commented 4 years ago

ya i think nucmer/promer might win here, with a simple script to calculate average percent identity given the total base pairs in the reference. Nucmer is quite fast for two bacterial genomes.

dib-lab / charcoal