Runtime for Large Dataset (1.3M seqs)

BioinformaticsToolsmith / Identity

Other

33 stars 3 forks source link

Hi there,

Thanks for making this tool available and for the clean repo!

I am trying to run MeshClust on a set of 1.3 million sequences with lengths ranging from 75bp - 6000bp. From the paper I saw that you were able to run meshclust on a microbiome dataset which comprised ~1 million sequences in ~2hrs with the hardware specified in the paper.

I've run meshclust on my dataset with a calculated identity threshold and its been running for 12 hrs and has only processed 160k sequences and is on the first data pass. I see that there are still many ~50k seqs in the reservoir. I'm guessing the reason it is taking so long to run is that the resevoir is continually being filled and then the initialization step for mean shift is being rerun causing the long runtime.

I wanted to check to see if you had any ideas why its taking this long or ways I could maybe split the data for better runtime?

Kind regards, Evan

BioinformaticsToolsmith / Identity

Runtime for Large Dataset (1.3M seqs) #14