Open taylorreiter opened 4 years ago
mummer + circos is cool, but it might be too heavy weight for this. Here are fastANI results
SRR4033063_bin.66.fna GCA_000435835.1_MGS221_genomic.fna NA
SRR2857885_bin.40.fa GCA_000154445.1_ASM15444v1_genomic.fna 87.6559 744 991
SRR1793410_bin.1.fna GCA_002925905.1_ASM292590v1_genomic.fna 86.4279 1242 1696
it might be sufficient to compare fastani with f_major
* f_ident
to develop a cutoff, but this also uses k-mers/sketching, so not sure this is the best idea?
Amino acid identity could be a good things to use, but predicting AA seqs + calculating AAI might be computationally expensive
for evaluation/setting thresholds, I think heavyweight methods are fine, but I'm not sure we should trust them to tell us about contamination?
it does seem circular to use ANI, or at least we'd have to be careful.
what I did in sourmash-oddify was use nucmer to align putative contaminant to candidate reference genome. that seemed reliable-ish.
ya i think nucmer/promer might win here, with a simple script to calculate average percent identity given the total base pairs in the reference. Nucmer is quite fast for two bacterial genomes.
Currently, if
f_ident
is < 10%, the user receives a warning:too few identifiable hashes; f_ident < 10%. provide a lineage for this genome.
Given that this parameter was selected randomly, we may need to adjust it. Digging into charcoal results for some microbiomes derived from human gut microbiomes, we observe the following three examples. From these three examples, I recommend that we increase thef_major
required for a charcoal to run whenf_ident
is low. I propose a pipeline below to investigate the "best" parameters to use as defaults. For now, maybe something along the lines off_major
*f_ident
=> .05 for charcoal to proceed?SRR4033063_bin.66
brieftax
: genus g__Clostridiumf_major
: 0.25824176f_ident
: 0.10016511Digging into
*report.txt
, we see that genbank accession GCA_000435835.1 is the best match. If we run nucmer + circos for visualization, we see that the bin and this genome from genus Clostridium have very few nucleotides in common, likely indicating that clostridium is actually a contaminant in this bin. The small blue track with the red arrow is the bin in this case, while the outer edge is the reference genome.GTDB-tk assigns this bin to
d__Bacteria;p__Firmicutes;c__Bacilli;o__RFN20;f__CAG-449;g__CAG-449
.SRR2857885_bin.40
brieftax
: genus g__Intestinibacterf_major
: 0.87025316f_ident
: 0.10501828Digging into
*report.txt
, we see that genbank accession GCA_000435835.1 is the best match. If we run nucmer + circos for visualization, we see that the bin and this genome from genus intestinibacter have many nucleotides stretches with high percent identity, likely indicating that Intestinibacter is the correct identity of this bin. The small blue track is the bin in this case, while the outer edge is the reference genome.GTDB-tk assigns this bin to
d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Peptostreptococcales;f__Peptostreptococcaceae;g__Intestinibacter
, or the same lineage as charcoal.SRR1793410_bin.1
brieftax
: genus g__Klebsiella_Af_major
: 0.3556962f_ident
: 0.15665279SRR1793410_bin.1 is between the two above cases, with a slighly higher
f_ident
but fairly lowf_major
. Digging into*report.txt
, we see that genbank accession GCA_002925905.1 is the best match. If we run nucmer + circos for visualization, we see that the bin and this genome from genus g__Klebsiella_A have many nucleotides stretches with high percent identity, likely indicating that Ig__Klebsiella_A is the correct identity of this bin. The small blue track is the bin in this case, while the outer edge is the reference genome.GTDB-tk assigns this bin to
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Klebsiella_A
, or the same lineage as charcoal.Proposed pipeline to continue investigating these parameters
To generate these figures, I grabbed the best clean genome match from gather at the bottom of the
*results.txt
file, ran nucmer + circos, and then ran GTDBtk on the bin to validate lineage. We could determine what thresholdf_major
*f_ident
by using nucmer results and calculating # bp > XX% percent identity between bin and reference / total bp in reference`where we would expect ~>70% similarity to be considered the same genus?The benefit of nucmer is it is fast, while gtdbtk is quite slow to run at scale. We could adopt the cutoffs used by gtdbtk though for amino acid or nucleotide percent identity required for genus level classification.