BioinformaticsToolsmith / Identity

Other
33 stars 3 forks source link

Runtime for Large Dataset (1.3M seqs) #14

Open e-trop opened 1 year ago

e-trop commented 1 year ago

Hi there,

Thanks for making this tool available and for the clean repo!

I am trying to run MeshClust on a set of 1.3 million sequences with lengths ranging from 75bp - 6000bp. From the paper I saw that you were able to run meshclust on a microbiome dataset which comprised ~1 million sequences in ~2hrs with the hardware specified in the paper.

I've run meshclust on my dataset with a calculated identity threshold and its been running for 12 hrs and has only processed 160k sequences and is on the first data pass. I see that there are still many ~50k seqs in the reservoir. I'm guessing the reason it is taking so long to run is that the resevoir is continually being filled and then the initialization step for mean shift is being rerun causing the long runtime.

I wanted to check to see if you had any ideas why its taking this long or ways I could maybe split the data for better runtime?

Kind regards, Evan

hani-girgis commented 1 year ago

Hi, Evan.

The length range of the microbiome sequences in the paper is 171–372, which is more homogenous than yours.

Yes, dividing your data set based on length would work. Then I would cluster each group separately.

After that you may want to extract the centers and use Identity (all-vs-all) on the centers and merge (select one) centers that are similar (with identity scores greater than the threshold).

Finally, run Identity on the reduced center set and the entire data set and assign a sequence to the closest center.

This is a work around for now. But this process can be automated in future releases.

Let me know if you have additional questions.

Best regards.

Hani