computing ANI for metagenomes vs ref genomes

ctb commented 3 years ago

there are three ways to do this.

mapping based: map reads, call variants, build a variant call file (VCF), estimate ANI from VCF SNP calls.
mapping based: create a new consensus genome based on VCF (bcftools supports this natively!)
assembly based: do spacegraphcats queries (which will get all mapped reads + graph neighborhood), run assembler.

I’m in the process of trying (2) and (3) now. All very straightforward with tools we have.

Not clear what to do about genomes that are not fully covered, will do some exploratory analysis! probably best to eliminate those regions from consideration (which is what an assembly based approach will do).

comments from @taylorreiter -

ANI is a more accepted/understood metric for genome relatedness than the %s gather reports. Would tell you how related thing in Q is to all the Rs in a database. Theoretically could insert into a tree based on that info, i think a la pplacer?

comments from @widdowquinn -

I suspect, but have not shown, that there is an asymptotic relationship between proportion of genome covered and ANI %identity. I wouldn’t be surprised if the proportion of a bacterial genome you need for an acceptable estimate of ANI %id is relatively small (cgMLST/CNI may well even be overkill…). However, %coverage does matter for interpretation, and the extent of mapping is an upper bound on the observed %coverage.

and a somewhat tangential comment from nanopore considerations -

I would expect k-mer approaches to always work better here; I think it might be satisfying to demonstrate that ;)

ctb commented 3 years ago

OK - did this for GCA_001509055.1 in hu-s1 (SRR1976948) and it was illuminating (hah!) but not really surprising -

genbank.fa - the genbank version of the genome
consensus.fa - the output of bcftools consensus
megahit.fa - sgc neighborhood -> megahit.

FastANI

sez they're all basically the same, as we would expect --

% fastANI --ql list.txt --rl list.txt  -o  xxx
% cat xxx
consensus.fa    consensus.fa    100     128     129
consensus.fa    genbank.fa      99.8343 128     129
consensus.fa    megahit.fa      99.6736 113     129
genbank.fa      genbank.fa      100     128     129
genbank.fa      consensus.fa    99.8344 128     129
genbank.fa      megahit.fa      99.7843 113     129
megahit.fa      megahit.fa      99.9999 125     125
megahit.fa      genbank.fa      99.6355 102     125
megahit.fa      consensus.fa    99.5259 102     125

sourmash compare

sez the genbank and consensus are quite similar, but the megahit one is quite different

% sourmash compute -k 31 --scaled=1000 *.fa
% sourmash compare *.sig
0-consensus.fa          [1.    0.921 0.666]
1-genbank.fa            [0.921 1.    0.689]
2-megahit.fa            [0.666 0.689 1.   ]

this highlights two things - ANI != Jaccard similarity, and bcftools consensus and sgc both seem to ~work :)

widdowquinn commented 3 years ago

That's interesting. What does the megahit assembly look like? From the limited information in the two tables I'd assume that there were regions of low similarity/absence in both assemblies (but especially megahit's) that fastANI was ignoring - i.e. coverage is effectively lower. It may be that, in low coverage comparisons, Jaccard approaches give a more "honest" estimate of genome similarity (taking into account overall composition), but may underestimate the relatedness of regions that are homologous.

ctb commented 3 years ago

I agree! (I'm not sure how to answer the question "what does the megahit assembly look like" tho :)

This is just more of the "ANI is about shared regions, Jaccard measures highlights differences outside of shared regions too" concept.

I'm going to do this on a bunch more genomes and look into the results more systematically once I have bigger n; I have all the raw results, just need to do the preprocessing!

ctb commented 3 years ago

Here's an interesting case --

FastANI output:

genbank-GCA_001508995.1.fna.gz          genbank-GCA_001508995.1.fna.gz          100       685    694
genbank-GCA_001508995.1.fna.gz          megahit-GCA_001508995.1.megahit.fa.gz   99.945    677    694
genbank-GCA_001508995.1.fna.gz          cons-GCA_001508995.1.consensus.fa.gz    96.6229   645    694

sourmash compare output:

0-cons-GCA_001508...    [1.    0.392 0.416]
1-genbank-GCA_001...    [0.392 1.    0.434]
2-megahit-GCA_001...    [0.416 0.434 1.   ]

- both the mapping-based consensus genome and the sgc genome are more similar to each other (by a lot) than they are to the genbank genome!

Note also that we searched ALL of genbank, so there is no better genome out there than that one - and we built two that much more closely match to the SRR1976948 metagenome!

ctb commented 3 years ago

I have dozens of these now. I wonder what the best way to summarize is 🤔 .

We have triples of genomes - genbank, megahit, and consensus.
we could graph pairwise correlations between ANI.
...not sure what to do with the sourmash compare output...

I mean, it's kinda clear that ANI and Jaccard similarities are poor proxies for each other...

I guess a different thing we really want is a summary table for metagenome-derived genome ANIs vs genbank. So, for each metagenome, we'd produce a table:

best genbank genome accession + name
metagenome-derived genome ANI to that genbank genome
metagenome-derived genome Jaccard similarity to that genbank genome

and maybe that would be a valuable report for people considering whether or not to construct a new metagenome-derived genome?

ctb commented 3 years ago

Hmm, one more thing is, how much of the genbank/consensus genome is covered by the reads here? (100% of the megahit genome is covered by the reads)

ctb commented 3 years ago

ok, some specific questions to ask -

can we use the genbank <-> consensus ANI as predictor for the (more correct, but more expensive) genbank <-> sgc ANI?
can we use the genbank <-> consensus Jaccard as predictor for the (more correct, but still somewhat expensive) genbank <-> sgc Jacccard?

(and then we need to get into "what is ANI useful for?")

ctb commented 3 years ago

hu-s1-ani

huh. not what I was expecting! misc notes --

mapping seems to only work out to ~93% ANI, which makes sense (that there's a limit, at least)
spacegraphcats+assembly gets us out to an ANI of ~82-85%, which is interesting!
little to no correlation? that seems odd!
it seems interesting to me that there are situations where there is a high sgc+megahit ANI, but not a high mapping ANI.
I suspect high mapping ANI with low genbank/sgc ANI is where a lot of the genome is not present in the metagenome. I should look at genome coverage or something.

dunno if this is useful. things to ruminate on.

taylorreiter commented 3 years ago

May not be relevant and I may be misinterpreting but -- some of Hu SB1 was in Hu SB2. Best match in genbank could originate from SB2 and not SB1 which could cause weird results? Seems weird because gather should already pull the best results...just thinking out loud, and as I do so this comment is making less sense yet i'm going to post it here anyway :)

ctb commented 3 years ago

I was thinking about the same thing and came to the same realization (that it shouldn't matter). However, it is definitely important to realize that some of the reference genomes were built from hu-s1, and some weren't! That some were taken directly from hu-s1 is why we have a lot of ANI~100% in there, I think!

widdowquinn commented 3 years ago

I agree! (I'm not sure how to answer the question "what does the megahit assembly look like" tho :)

Fair point… ;)

I think I mean in terms of genome completeness, CDS accuracy/completeness, that sort of thing. I'm not very familiar with output from MEGAHIT.

[edit] sorry - I'm being a bit slow… I just read properly and realised you're using a community read dataset as input; I had misunderstood. Apologies.

widdowquinn commented 3 years ago

Hmm, one more thing is, how much of the genbank/consensus genome is covered by the reads here? (100% of the megahit genome is covered by the reads)

I agree (now I've better internalised what's going on). I think we may be able to make a case for a threshold coverage (or something that expresses relative confidence) where we have previously observed high ANI %ID, but very low %coverage between distantly-related complete genomes.

ctb commented 3 years ago

I agree! (I'm not sure how to answer the question "what does the megahit assembly look like" tho :)

Fair point… ;)

I think I mean in terms of genome completeness, CDS accuracy/completeness, that sort of thing. I'm not very familiar with output from MEGAHIT.

THAT's what a megahit assembly looks like 😆 I haven't measured any of that stuff. Easy enough to do, but MAGs are usually a mess so 🤷 I had some more thoughts about how I need to be looking at ANI with only the covered bits of the genome. Tricky-ish to do. Will think about how to do it best.

dib-lab / genome-grist

computing ANI for metagenomes vs ref genomes #18

FastANI

sourmash compare