dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

Handle massively clustered paralogs in step 6 #391

Closed isaacovercast closed 4 years ago

isaacovercast commented 4 years ago

Sometimes this happens:

-------------------------------------------------------------
  ipyrad [v.0.9.31]
  Interactive assembly and analysis of RAD-seq data
------------------------------------------------------------- 
  Step 6: Clustering/Mapping across samples
  [####################] 100% 0:01:22 | concatenating inputs
  [####################] 100% 1:11:54 | clustering tier 1
  [####################] 100% 0:00:05 | concatenating inputs
  [####################] 100% 0:30:22 | clustering across
  [####################] 100% 0:01:19 | building clusters
  [################### ]  98% 13:50:44 | aligning clusters

Where it'll just sit there and seem to be stuck aligning clusters for a very very long time. This sometimes happens when you get a ton of paralogous consensus sequences per sample clustering together into massive super loci:

from collections import Counter

f = "x.chunk_8455"
chunks = open(f).read().split("\n//\n//")
chunks = [x.split("\n") for x in chunks]

c = Counter([len(x) for x in chunks])
badc = ''
for x in chunks:
    if len(x) > 1000:
        print(len(x))
        print(x[:5])
        cnames = Counter([y.rsplit("_", 1)[0] for y in x if ">" in y])
        badc = x
        break
cnames

39623
['', '>x1_29143', 'TGCAGCTGTTCATGCTCATCGTGCTAATAGAAGAGCAAATAATATGTACCATTATACATCCAGGAGACACCATTCACAAATTCATGAAAATGACATnnnnGGCACTATATAGGCATATCCAYGCNAATGCGATGCATATGCCTATACCATCTTGAAAAAAAAATCTAACGCGAGCGAAGCCGNKGGCAAAAGCTAGTWNKRKWNAW', '>x2_5312', 'TGCAGCGATGGAGGTATATCCACGGCAATRCAATGCATATGCCTATACCATCTTGAAAAAAAATCTAACGCGAGYGAAGCCGCGGGCAAAAGCTAGnnnnGATAAAACAAAAATATTTACATTATTTACATAAAATGAACCAAAACAAAACCACTTATCTGGAGGAATCGGAAGGTGAGAACATCCCTAAACAGAAACGTAAAT']
Counter({'>x1': 365,
         '>x2': 519,
         '>x3': 255,
         '>x4': 343,
         '>x5': 426,

Now it's trying to align a locus with 40k sequences, which explains why it's taking a while ;p. I've seen this a couple times before and really we should handle this better, it's just wasting time doing these alignments if there are duplicate sample names in a given locus, because these will be thrown out at step 7 anyway. Gotta be something smarter to do here....

eaton-lab commented 4 years ago

hmm, I know that in step3 we limit the number of things that will be aligned (e.g., only top 200 sorted by frequency) and discard singletons at the end of the list for speed.

If step6 we should not align at all if the number of consens seqs is greater than the number of samples.

eaton-lab commented 4 years ago

I'll take a stab at it right now.

dereneaton commented 4 years ago

This was finished a while ago. It no longer tries to align clusters in step6 that contain duplicate labels. They are still kept until step7 (i.e., saved in clust_database) but labeled to be filtered.