Open ctb opened 3 years ago
OK - did this for GCA_001509055.1 in hu-s1 (SRR1976948) and it was illuminating (hah!) but not really surprising -
sez they're all basically the same, as we would expect --
% fastANI --ql list.txt --rl list.txt -o xxx
% cat xxx
consensus.fa consensus.fa 100 128 129
consensus.fa genbank.fa 99.8343 128 129
consensus.fa megahit.fa 99.6736 113 129
genbank.fa genbank.fa 100 128 129
genbank.fa consensus.fa 99.8344 128 129
genbank.fa megahit.fa 99.7843 113 129
megahit.fa megahit.fa 99.9999 125 125
megahit.fa genbank.fa 99.6355 102 125
megahit.fa consensus.fa 99.5259 102 125
sez the genbank and consensus are quite similar, but the megahit one is quite different
% sourmash compute -k 31 --scaled=1000 *.fa
% sourmash compare *.sig
0-consensus.fa [1. 0.921 0.666]
1-genbank.fa [0.921 1. 0.689]
2-megahit.fa [0.666 0.689 1. ]
this highlights two things - ANI != Jaccard similarity, and bcftools consensus
and sgc both seem to ~work :)
That's interesting. What does the megahit assembly look like? From the limited information in the two tables I'd assume that there were regions of low similarity/absence in both assemblies (but especially megahit's) that fastANI was ignoring - i.e. coverage is effectively lower. It may be that, in low coverage comparisons, Jaccard approaches give a more "honest" estimate of genome similarity (taking into account overall composition), but may underestimate the relatedness of regions that are homologous.
I agree! (I'm not sure how to answer the question "what does the megahit assembly look like" tho :)
This is just more of the "ANI is about shared regions, Jaccard measures highlights differences outside of shared regions too" concept.
I'm going to do this on a bunch more genomes and look into the results more systematically once I have bigger n; I have all the raw results, just need to do the preprocessing!
Here's an interesting case --
FastANI output:
genbank-GCA_001508995.1.fna.gz genbank-GCA_001508995.1.fna.gz 100 685 694
genbank-GCA_001508995.1.fna.gz megahit-GCA_001508995.1.megahit.fa.gz 99.945 677 694
genbank-GCA_001508995.1.fna.gz cons-GCA_001508995.1.consensus.fa.gz 96.6229 645 694
sourmash compare output:
0-cons-GCA_001508... [1. 0.392 0.416]
1-genbank-GCA_001... [0.392 1. 0.434]
2-megahit-GCA_001... [0.416 0.434 1. ]
- both the mapping-based consensus genome and the sgc genome are more similar to each other (by a lot) than they are to the genbank genome!
Note also that we searched ALL of genbank, so there is no better genome out there than that one - and we built two that much more closely match to the SRR1976948 metagenome!
I have dozens of these now. I wonder what the best way to summarize is 🤔 .
I mean, it's kinda clear that ANI and Jaccard similarities are poor proxies for each other...
I guess a different thing we really want is a summary table for metagenome-derived genome ANIs vs genbank. So, for each metagenome, we'd produce a table:
and maybe that would be a valuable report for people considering whether or not to construct a new metagenome-derived genome?
Hmm, one more thing is, how much of the genbank/consensus genome is covered by the reads here? (100% of the megahit genome is covered by the reads)
ok, some specific questions to ask -
(and then we need to get into "what is ANI useful for?")
huh. not what I was expecting! misc notes --
dunno if this is useful. things to ruminate on.
May not be relevant and I may be misinterpreting but -- some of Hu SB1 was in Hu SB2. Best match in genbank could originate from SB2 and not SB1 which could cause weird results? Seems weird because gather should already pull the best results...just thinking out loud, and as I do so this comment is making less sense yet i'm going to post it here anyway :)
I was thinking about the same thing and came to the same realization (that it shouldn't matter). However, it is definitely important to realize that some of the reference genomes were built from hu-s1, and some weren't! That some were taken directly from hu-s1 is why we have a lot of ANI~100% in there, I think!
I agree! (I'm not sure how to answer the question "what does the megahit assembly look like" tho :)
Fair point… ;)
I think I mean in terms of genome completeness, CDS accuracy/completeness, that sort of thing. I'm not very familiar with output from MEGAHIT.
[edit] sorry - I'm being a bit slow… I just read properly and realised you're using a community read dataset as input; I had misunderstood. Apologies.
Hmm, one more thing is, how much of the genbank/consensus genome is covered by the reads here? (100% of the megahit genome is covered by the reads)
I agree (now I've better internalised what's going on). I think we may be able to make a case for a threshold coverage (or something that expresses relative confidence) where we have previously observed high ANI %ID, but very low %coverage between distantly-related complete genomes.
I agree! (I'm not sure how to answer the question "what does the megahit assembly look like" tho :)
Fair point… ;)
I think I mean in terms of genome completeness, CDS accuracy/completeness, that sort of thing. I'm not very familiar with output from MEGAHIT.
there are three ways to do this.
I’m in the process of trying (2) and (3) now. All very straightforward with tools we have.
Not clear what to do about genomes that are not fully covered, will do some exploratory analysis! probably best to eliminate those regions from consideration (which is what an assembly based approach will do).
comments from @taylorreiter -
comments from @widdowquinn -
and a somewhat tangential comment from nanopore considerations -