adriantich / DnoisE

Distance denoise by Entropy
GNU General Public License v3.0
11 stars 2 forks source link

How to deal with many singletons in the output? #31

Closed timz0605 closed 1 month ago

timz0605 commented 1 month ago

Hello!

I am currently working on a COI metabarcoding project of marine samples and using the Leray COI primer for my analyses. Therefore, many examples shown here and in the paper apply to my study.

After running the DnoisE program, I ended up with approx. 520,000 sequences. Starting with the 150,000th sequence, everything below is Singleton. While I was using the unoise algorithm, I ended up with much fewer sequences. This is likely because unoise has a default but also customizable --minsize option. For DnoisE, I was wondering if there is a similar option?

On the other hand, I was wondering if you have any suggestions for dealing with this situation? Namely, how should I decide a threshold where I will discard all the sequences that do not meet a certain "size" (which stands for frequency) in the final output?

Below are some codes I used as reference. I would greatly appreciate any help! Thank you!

## Dereplication
vsearch --derep_fulllength combined.fasta --sizeout --relabel uniq. --output uniques.fasta
## Denoising 
## This is the command I use in DnoisE
python3 ~/DnoisE/dnoise/DnoisE.py --fasta_input uniques.fasta --fasta_output DnoisE.fasta -y -c 8 -m 313 -u
## Also providing the command I use in unoise as a reference
vsearch --cluster_unoise uniques.fasta --sizein --minsize 72 --sizeout --relabel denoised. --centroids denoised.fasta
(--minsize is calculated by total number of sequences concatenated * 0.001%, and then round up or round down)
adriantich commented 1 month ago

Hello, yes, in DnoisE there is the -r --min_abund option that will remove all the sequence below this threshold. About which threshold use... this is a more difficult question that each pipeline should address and many scenarios can be taken into account.

In our pipeline MJOLNIR3 (soon named ASGARD) we apply different minimum abundance thresholds at different steps. ie. If you want to run SWARM after DnoisE but keeping the maximum number of sequences to be able to create more connected swarms and apply the filter of minimum abundance after, it is til a good strategy to remove the singletons to improve drastically the running performance of SWARM.

However, as final filtering step, remove low abundance sequences will be useful to remove those spurious sequences that have remained during the pipeline. But the number of samples, the marker used, the type of sample substrate, the target species and the ecological questions, among others, have to be considered to be more or less restrictive.

In Summary: First question: easy answer --> YES there is an option Second question: I believe that there is not a "correct" answer other than "It depends"

Hope you find it useful.

Adrià

timz0605 commented 1 month ago

Hello Adrià,

Thank you so much for the swift response!

Following up with your insights, one way I am setting the -r --min_abund option is simply following the Bokulich et al., 2013 paper (https://www.nature.com/articles/nmeth.2276), which identified a minimum abundance threshold to be about 0.001% of the total sequence number. E.g., in my example code, I have approx. 7,200,000 sequences, so I set --minsize 72 (this 0.001% threshold is not written directly in the paper, but something my collaborator has suggested). If I implement this number during the denoising step with DnoisE, the final sequence count is approx. 6000, which makes more sense for downstream analyses compared to 520,000.

Going on with that, I was wondering:

1/ Is there a default calculation method in DnoisE, or if there is a default number set? (like --minsize 8 in unoise)

2/ If no, could you tell me what your approach was when analyzing COI data? Do you filter the abundance later, do you use the --min_abund option, or you use some other approaches?

I am not using SWARM after the denoising step. I simply remove the chimera, and then generate the frequency table for downstream analyses. Therefore, I would appreciate any input from you, as I am still relatively new to this field.

Thank you!

adriantich commented 1 month ago

the default number is 1. The other question I need some time to check. I don't want to give you an answer without checking but you can take a look to MJOLNIR3, my answer would be based on it (https://github.com/adriantich/MJOLNIR3)

timz0605 commented 4 weeks ago

Hello Adrià!

This is just to follow up on our previous discussions. I was wondering if you've had some time to look at how to choose a recommended --min_abund option? Thank you!