frederic-mahe / mumu

C++ implementation of lulu, a R package for post-clustering curation of metabarcoding data
GNU General Public License v3.0
7 stars 0 forks source link

Similarity threshold for 16S V4 amplicon data? #5

Closed eperezv closed 7 months ago

eperezv commented 11 months ago

Hello,

Thank you for coding this nice software. I was wondering whether there is any recommendation or previous tests that indicate an optimal or recommended threshold to work with mumu for 16S amplicon datasets. I know that for ITS, the authors of LULU recommended a minimum similarity of 84% and coverage of 80%, but I was wondering whether those thresholds are ok for 16S amplicon datasets.

Thank you.

frederic-mahe commented 11 months ago

Thanks @eperezv

That's a good question but I don't have an answer. As far as I know, no one as ever reported an assessment of optimal lulu/mumu parameters for different markers. In my own work with 16S and 18S data, I've always used mumu with default parameters (I don't use mumu systematically, only when there is a focus on alpha-diversity in downstream analyses).

If I were to test different thresholds, I would look at the number of unique taxonomic assignments before and after mumu, and the number of remaining clusters. Optimal threshold values should preserve as much taxonomic diversity as possible, while substantially reducing the number of clusters. As you can see, what you define as optimal parameters might depend on what you value most: low numbers of clusters or capacity to detect taxa.

eperezv commented 11 months ago

Thank you for your answer! It is a very good point to evaluate the number of unique taxonomic assignments and how many are lost depending on the threshold.

mimouschka commented 6 months ago

Hi @frederic-mahe I did some comparisons of lulu across a few markers: see https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13398 Thus my question: is there a way to adjust the min ratio in MUMU ? this is the crucial parameter when working with markers other than ITS (much more than min co-occurence in my experience)

frederic-mahe commented 6 months ago

Hi @mimouschka mumu offers the same options as lulu, so yes, there is an option controlling the minimum ratio:

-c, --minimum_ratio positive float minimum or average abundance ratio observed between a potential parent OTU and a query OTU (method is controlled by the option --minimum_ratio_type). For each sample where both the potential parent OTU and the query OTU are present, the local abundance of the potential parent OTU is divided by the local abundance of the query OTU. If the minimum or average observed value is smaller or equal to the 'minimum_ratio' value, then the potential parent OTU is rejected. Any positive value greater than zero is accepted, and the default value is 1.0.

this is the crucial parameter when working with markers other than ITS (much more than min co-occurence in my experience)

I think that the min co-occurence mechanism in lulu has a bug (see https://github.com/tobiasgf/lulu/issues/8), so changing the min co-occurence should have a limited effect on lulu's results.

frederic-mahe commented 6 months ago

Following up on your question (https://github.com/tobiasgf/lulu/issues/20), the min ratio is the abundance of the potential parent divided by the abundance of the child. The default value is 1.0, meaning that the child can be almost as abundant as the potential parent.

A ratio of 1001/1000 passes the filter. A ratio of 1000/1000 does not. You generally want to set min ratio to values >= 1.0. A min ratio smaller than 1.0 would mean that the child can be more abundant than the potential parent, which defeats the purpose of the whole approach.

frederic-mahe commented 6 months ago

lulu's documentation gives an example where minimum_ratio = 10. As I understand it, it means that the potential parents must be more than 10 times more abundant than the child.

mimouschka commented 6 months ago

Hi @mimouschka mumu offers the same options as lulu, so yes, there is an option controlling the minimum ratio:

-c, --minimum_ratio positive float minimum or average abundance ratio observed between a potential parent OTU and a query OTU (method is controlled by the option --minimum_ratio_type). For each sample where both the potential parent OTU and the query OTU are present, the local abundance of the potential parent OTU is divided by the local abundance of the query OTU. If the minimum or average observed value is smaller or equal to the 'minimum_ratio' value, then the potential parent OTU is rejected. Any positive value greater than zero is accepted, and the default value is 1.0.

this is the crucial parameter when working with markers other than ITS (much more than min co-occurence in my experience)

I think that the min co-occurence mechanism in lulu has a bug (see tobiasgf/lulu#8), so changing the min co-occurence should have a limited effect on lulu's results.

Thanks for your explanation @frederic-mahe ! I'll try it out and let you know!

mimouschka commented 6 months ago

Following up on your question (tobiasgf/lulu#20), the min ratio is the abundance of the potential parent divided by the abundance of the child. The default value is 1.0, meaning that the child can be almost as abundant as the potential parent.

A ratio of 1001/1000 passes the filter. A ratio of 1000/1000 does not. You generally want to set min ratio to values >= 1.0. A min ratio smaller than 1.0 would mean that the child can be more abundant than the potential parent, which defeats the purpose of the whole approach.

Thanks for your explanation, reading the methods of the original lulu paper and its GitHub readme made me understand that the ratio was abundance of child / potential parent. I agree that the child abundance should be lower than the parent's, and I think that allowing it to be up to as abundant as the parent's (ratio=1) can lead to the loss of OTUs for closely related co-occuring species

frederic-mahe commented 6 months ago

reading the methods of the original lulu paper and its GitHub readme made me understand that the ratio was abundance of child / potential parent.

Yes, documentation is unclear. The best way to know for sure is to make a test:

require(dplyr)
require(lulu)

## create an example with an abundance ratio > 10
matchlist <- data.frame(x = "B" , y = "A", z = 99.0)
otutable <- data.frame(
  row.names = c("A", "B"),
  s1 = c(12, 1),
  s2 = c(12, 1))

## - min ratio threshold = 0.1
## - if ratio is Child/Parent, then B is the parent
##   (A is merged with B)
lulu::lulu(otutable, matchlist, minimum_ratio = 0.1)$curated_table

## no merging

## Alternative hypothesis:
## - min ratio threshold = 10
## - if ratio is Parent/Child, then A is the parent
##   (B is merged with A)
lulu::lulu(otutable, matchlist, minimum_ratio = 10)$curated_table

## merging

Ratio seems to be Parent/Child in lulu (same thing in mumu).

In practice, I think lulu expects OTUs to be sorted by decreasing abundance and will not consider as potential parents OTUs that are lower in the list:

## B is more abundant than A, but appears after A
otutable <- data.frame(
  row.names = c("A", "B"),
  s1 = c(1, 12),
  s2 = c(1, 12))

## hypothesis: OTU order does not matter
lulu::lulu(otutable, matchlist, minimum_ratio = 0.1)$curated_table
lulu::lulu(otutable, matchlist, minimum_ratio = 10)$curated_table

## no merging

For lulu, order seems to matter, min ratios below 1.0 seem to be ignored.

mimouschka commented 6 months ago

Hi @frederic-mahe thanks for clarifying, very helpful indeed. so:

frederic-mahe commented 6 months ago

min ratio is min abundance ratio Parent/Child and can be >= 1 in both lulu and mumu (1: child can be up to as abundant as parent, 10: parent must be 10x more abundant than child for the child to be merged to the parent, etc...)

Yes, that is correct.

lulu expects OTUs to be sorted by decreasing abundance in the OTU table. This is not the case for MUMU (?).

mumu sorts input data. So, mumu's results are not dependent on input order.