Closed isaacovercast closed 4 years ago
hmm, I know that in step3 we limit the number of things that will be aligned (e.g., only top 200 sorted by frequency) and discard singletons at the end of the list for speed.
If step6 we should not align at all if the number of consens seqs is greater than the number of samples.
I'll take a stab at it right now.
This was finished a while ago. It no longer tries to align clusters in step6 that contain duplicate labels. They are still kept until step7 (i.e., saved in clust_database) but labeled to be filtered.
Sometimes this happens:
Where it'll just sit there and seem to be stuck aligning clusters for a very very long time. This sometimes happens when you get a ton of paralogous consensus sequences per sample clustering together into massive super loci:
Now it's trying to align a locus with 40k sequences, which explains why it's taking a while ;p. I've seen this a couple times before and really we should handle this better, it's just wasting time doing these alignments if there are duplicate sample names in a given locus, because these will be thrown out at step 7 anyway. Gotta be something smarter to do here....