Clustering long-read 18S amplicons

pjmramond commented 10 months ago

Hello there Thanks a lot already for the work on this package!

I am trying to cluster 34,937,058 sequences of about 1000bp (18S amplicons) contained in a single fasta file, I'm using the following code on HPC:

meshclust \
  -d /export/lv6/projects/NIOZ320/Analysis/3.1_Ecological_Analysis/18S_NIOZ320_NIOZ326.fa \
  -o /export/lv6/projects/NIOZ320/Analysis/3.1_Ecological_Analysis/consensus_95/18S_NIOZ320_NIOZ326_cl_0.95.txt \
  -t 0.95 \
  -b 45000 \
  -v 180000

The code has been running for 125 days and was about to finish its 4th run, which I thought would be the last, but a 5th clustering run of the data has started (see screenshot). This last run indicate from the beginning that there are "0 unprocessed sequences" and the number of found centers has been stagnating around 47,900 for quite sometime.

I understand that this is a lot of data and that the error rate of Oxford Nanopore reads probably adds complexity to the clustering algorithm. The amplicons have nevertheless been quality filtered and represent consensuses of several amplicons (pre-clustered based Unique Molecular Identifiers). A previous Meshclust run with a similar approach but 16S data took ~80 days to cluster 33,306,880 amplicons and found 55,715 centers.

My questions are: 1) Am I doing something wrong here? Can Meshclust support such a computation? ("swarm -d 3" ran faster but clustered only 500K reads).

2) Is there a way to stop the run at this stage and get the current output (centers and their composition)? Is there a way to predict how many runs will it take Meshclust to give an output?

Any help would be highly appreciated! Best Pierre

Capture d’écran 2024-01-18 à 16 12 14

Capture d’écran 2024-01-18 à 16 30 44

Capture d’écran 2024-01-18 à 16 30 59

pjmramond commented 9 months ago

and now starting run 6...

hani-girgis commented 9 months ago

Hi, Pierre.

Thanks for your interest in MeShClust v3.0.

No, you are not doing any thing wrong. MeShClust v3 should take longer than MeShClust v1 because of the all-vs-all done at the beginning and when there are enough sequences are accumulated in the reservoir.

The current version does NOT log the results (this is a good feature to include in the next release God willing).

The -p parameter controls the number of data passes (default: 10). The algorithm may converge before the 10th iteration if the number of clusters does not change during a data pass. The good news is that the algorithm should run faster in late iterations than the early ones because it may not need to do as many as of the all-vs-all blocks.

If you would like to speed up the algorithm in the future, you may want to reduce the size of the all-vs-all block (-b) and increase the size of the batch (-v).

Please keep me posted.

Hani Z. Girgis, PhD

hani-girgis commented 9 months ago

Hello again, Pierre.

I am working on the next version of MeShClust. It would be very helpful to use your data while developing and testing it. Are the 16S or the 18S data already published? If yes, where can I download the data set(s). Feel free to email me at hzgirgis at buffalo dot edu.

Best regards.

Hani Z. Girgis, PhD

BioinformaticsToolsmith / Identity

Clustering long-read 18S amplicons #21