Long time running no outputs

JappyPing commented 1 year ago

Hi Xiyu,

I am trying to use DAUMI running on an amplicon sequencing data set with 1M reads for deduplication.

The first step of the run_AmpliCI cluster has been running for almost ten hours, but no outputs, and it is still running. I wonder, is this normal?

INFO [/projects/BIOinfo/Jappy/review/methods/deduplication/AmpliCI/src/options.c::parse_options(158)]: Command: cluster INFO [/projects/BIOinfo/Jappy/review/methods/deduplication/AmpliCI/src/options.c::parse_options(369)]: Cluster UMIs .... INFO [/projects/BIOinfo/Jappy/review/methods/deduplication/AmpliCI/src/options.c::parse_options(245)]: Verbosity set to 1.

Thanks for your attention.

Best regards,

Pengyao

xiyupeng commented 1 year ago

Hi, Pengyao,

It is very likely that there are numerous sequences with very low abundance in your datasets. If this is true, unfortunately the first step of DAUMI would take long time. For this case, run_AmpliCI cluster might not be the best option for clustering very short sequences (like UMIs) for large-scale data.

I think the algorithm is still running. It would throw errors if there was a problem in the codes. If it does not stop by the time I write the answer, I would expect > 10k unique molecules. You can wait it to finish, or I would suggest the following solutions:

If your 1M sequences are combined from several samples, I would suggest running the algorithm per sample.
One quick solution is to increase the threshold of the lowest abundance using --abundance. It set to 2 by default (which means the algorithm will screen every unique sequence with abundance >=2). But there is a risk to miss UMIs/haplotypes with very low abundance.
The first two steps of DAUMI (see DAUMI's instruction) is to obtain the candidate UMI sequences and haplotypes for the third step, which is the core algorithm. You may use other software to obtain a FASTA file with candidate UMIs and haplotypes. For example, UMI-tools for clustering UMI.

Hope it helps!

Thanks, Xiyu

JappyPing commented 1 year ago

Hi, Pengyao,

It is very likely that there are numerous sequences with very low abundance in your datasets. If this is true, unfortunately the first step of DAUMI would take long time. For this case, run_AmpliCI cluster might not be the best option for clustering very short sequences (like UMIs) for large-scale data.

I think the algorithm is still running. It would throw errors if there was a problem in the codes. If it does not stop by the time I write the answer, I would expect > 10k unique molecules. You can wait it to finish, or I would suggest the following solutions:

If your 1M sequences are combined from several samples, I would suggest running the algorithm per sample.

One quick solution is to increase the threshold of the lowest abundance using --abundance. It set to 2 by default (which means the algorithm will screen every unique sequence with abundance >=2). But there is a risk to miss UMIs/haplotypes with very low abundance.

The first two steps of DAUMI (see DAUMI's instruction) is to obtain the candidate UMI sequences and haplotypes for the third step, which is the core algorithm. You may use other software to obtain a FASTA file with candidate UMIs and haplotypes. For example, UMI-tools for clustering UMI.

Hope it helps!

Thanks, Xiyu

Hi Xiyu,

I see. Thanks so much for the reply.

Have a good day.

Best regards,

Pengyao

DormanLab / AmpliCI

Long time running no outputs #9