Closed JappyPing closed 1 year ago
Hi, Pengyao,
It is very likely that there are numerous sequences with very low abundance in your datasets. If this is true, unfortunately the first step of DAUMI would take long time. For this case, run_AmpliCI cluster might not be the best option for clustering very short sequences (like UMIs) for large-scale data.
I think the algorithm is still running. It would throw errors if there was a problem in the codes. If it does not stop by the time I write the answer, I would expect > 10k unique molecules. You can wait it to finish, or I would suggest the following solutions:
If your 1M sequences are combined from several samples, I would suggest running the algorithm per sample.
One quick solution is to increase the threshold of the lowest abundance using --abundance. It set to 2 by default (which means the algorithm will screen every unique sequence with abundance >=2). But there is a risk to miss UMIs/haplotypes with very low abundance.
The first two steps of DAUMI (see DAUMI's instruction) is to obtain the candidate UMI sequences and haplotypes for the third step, which is the core algorithm. You may use other software to obtain a FASTA file with candidate UMIs and haplotypes. For example, UMI-tools for clustering UMI.
Hope it helps!
Thanks, Xiyu
Hi, Pengyao,
It is very likely that there are numerous sequences with very low abundance in your datasets. If this is true, unfortunately the first step of DAUMI would take long time. For this case, run_AmpliCI cluster might not be the best option for clustering very short sequences (like UMIs) for large-scale data.
I think the algorithm is still running. It would throw errors if there was a problem in the codes. If it does not stop by the time I write the answer, I would expect > 10k unique molecules. You can wait it to finish, or I would suggest the following solutions:
- If your 1M sequences are combined from several samples, I would suggest running the algorithm per sample.
- One quick solution is to increase the threshold of the lowest abundance using --abundance. It set to 2 by default (which means the algorithm will screen every unique sequence with abundance >=2). But there is a risk to miss UMIs/haplotypes with very low abundance.
- The first two steps of DAUMI (see DAUMI's instruction) is to obtain the candidate UMI sequences and haplotypes for the third step, which is the core algorithm. You may use other software to obtain a FASTA file with candidate UMIs and haplotypes. For example, UMI-tools for clustering UMI.
Hope it helps!
Thanks, Xiyu
Hi Xiyu,
I see. Thanks so much for the reply.
Have a good day.
Best regards,
Pengyao
Hi Xiyu,
I am trying to use DAUMI running on an amplicon sequencing data set with 1M reads for deduplication.
The first step of the run_AmpliCI cluster has been running for almost ten hours, but no outputs, and it is still running. I wonder, is this normal?
INFO [/projects/BIOinfo/Jappy/review/methods/deduplication/AmpliCI/src/options.c::parse_options(158)]: Command: cluster INFO [/projects/BIOinfo/Jappy/review/methods/deduplication/AmpliCI/src/options.c::parse_options(369)]: Cluster UMIs .... INFO [/projects/BIOinfo/Jappy/review/methods/deduplication/AmpliCI/src/options.c::parse_options(245)]: Verbosity set to 1.
Thanks for your attention.
Best regards,
Pengyao