davidesangui / metapresence

Metapresence is a Python tool to calculate metrics to assess the evenness of the coverage distribution on DNA sequences, in order to infer their presence in a given sample.
4 stars 0 forks source link

How to use metapresence efficiently for environmental metagenomes? #1

Closed bhagavadgitadu22 closed 4 days ago

bhagavadgitadu22 commented 3 weeks ago

I have 200 metagenomes extracted from stream biofilms. I binned contigs to obtain MAGs and I dereplicated MAGs obtained in all samples to obtain a final pool of MAGs. Finally I mapped reads from each sample on this MAG pool to obtain abundances with coverM. I removed false-positive presence using metapresence. I have two questions regarding the use of metapresence though:

1) Short-read mappers like minimap2 do not normally allow secondary reads: that means if a read equally map to one or another MAG it will randomly be allocated to one of the MAGs. It will not affect the overall abundance too much but I am thinking as metapresence is looking at the distance between consecutive reads this might be more impacted by this random allocation. Is it something I should be concerned about?

2) I used only the MAGs I made in this dataset because usual calculation of abundances involve many false-positives presence. Because metapresence removes these false-positives, could I add to my pool of MAGs all the MAGs available online for this kind of environment?

davidesangui commented 2 weeks ago

Hi!

  1. If the allocation is actually random and not biased, this should not be a problem since the read distance metric is essentially measuring randomness (how much randomly the reads map); thus adding other randomness should not affect its value. I used Bowtie2 for all my tests and it performs this random allocation of multimappers (see here); the results were good.

  2. Yes absolutely! This is the main reason why metapresence was developed; that is, being able to use the information available in genomic database without the problem of false positives. Going back to multimapping reads, you should probably dereplicate your MAGs with those available online before alignment.

Let me know if you have any further questions.

Davide

bhagavadgitadu22 commented 2 weeks ago

Thanks for the answers! Follow-up questions:

davidesangui commented 2 weeks ago

Hi,

  1. Interesting question! I made some tests with microbial MAGs while developing the tool. Imagine that you have a genome with 10,000 reads mapping on it. The reads map uniformly (i.e. randomly on all the genome) and metapresence gives you almost perfect metric values (BER = 0.99, FUG = 0.62), meaning that those reads were generated from a highly similar genome. If you added another genome which is almost identical to the previous one and you mapped the reads against both the genomes together, most of the reads would be multimappers and -- because of what we were saying above regarding short-read aligners and multimappers -- you would end up with around 5,000 mapped reads per genome. Still, the reads would map uniformly on both of them, the metric values of the first genome would be essentially unchanged and those of the second genome would be pretty much the same of the first (since they are almost identical). Of course, the more different the second genome is from the first (and from the genome which the reads were generated from), the lower will be its metric values. Going back to your specific case, if you have a cluster of two MAGs formed by one that you assembled directly from your samples and one from the viral database, then your MAG will most probably have higher metric values, for example, as you correctly pointed out, because of accessory regions in the MAG from the database. The difference between presence and absence depends on the metric thresholds that you use. The default ones (BER 0.8, FUG 0.5) are tailored for an ANI threshold of 95%. You may use higher metric threshold (for example BER 0.9 and FUG 0.55) to achieve higher resolution (I plan to do a more rigorous analysis regarding the relationship between coverage evenness and global measures of sequence similarity). You may also decide not to dereplicate before alignment and metapresence, but instead to dereplicate only the MAGs identified as present (and, after that, to re-scale relative abundance values). I apologize for the long answer, I think the topic of your interrogation has many things to consider and many potentialities; I hope to have answered your question (and thank you for bringing it up).

  2. I have no suggestions for reference databases to use, sorry. Regarding your example, I think that such an amount of viral MAGs should consist of around 10Gb of sequences (I just searched for the average viral genome size): for read alignment you would need a bit of patience (especially when building the index), but I think it is still something feasible. For metapresence, it is more a matter of number of reads than reference sizes (at least given these reference sizes).

Davide

bhagavadgitadu22 commented 2 weeks ago

Thank you very much for the detailed answers.

I still have a question regarding the inclusion of public MAGs in my dataset. If you have a huge dataset of reference many MAGs are gonna "steal" reads because they have a genomic region common with a present MAG. As a result these present MAGs will have "gaps" in their coverage and in my understanding they will likely be classified as absent by metapresence.

Basically I am worried the ratio of (presences/number of reference MAGs) decreases with the increase of reference MAGs. This bias might have an important impact when you look at all the MAGs available publicly. It will definitely not increase false positives but might increase false negatives a lot.

To remove the influence of the reference dataset size, I could try to find a short-read mapper allowing multimapping or I could work sequentially:

What do you think?

davidesangui commented 2 weeks ago

Hi,

I completely agree with you; what you are saying is indeed a potential source of error (false negatives) of the approach. A "symmetrical" situation is also commented in the article, where reads mismap from high-coverage genomes to some regions of low-coverage ones, resulting in anomalous "peaks" in the coverage profile of the latters.

Allowing for multimappers, as you suggest, would indeed work for the problem that you mention, but not for the symmetrical counterpart mentioned above, where it would be better to filter out all multimapping reads (something that, in my opinion, would hinder the identification of low-coverage genomes).

The thing is that any unevenness of coverage causes a lower value of the metrics. Still (without going into details) the coverage of absent genomes should be much more "uneven" than the cases that we are discussing. Thus, I think that the sequential solution that you propose is a clever way to overcome this limitation, at least when considering the problem from the side of the "gaps" and not that of "peaks". Thank you for your interesting and useful comments!!

Davide