dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

Step 6 muscle_it() - ValueError: dictionary update sequence element #0 has length 1; 2 is required #493

Closed isaacovercast closed 1 year ago

isaacovercast commented 1 year ago

This was reported by @alexkrohn on gitter.

I spent way too much time trying to figure out what was causing this. TL;DR if you have PE data and all the R2s are blank then you get a cluster like this:

>DBT331001_304
GGGCTNGGGGGGGGGGTGTCCCGTGGTTAGGGTAGGGAGCCAGGACTCCTGGGTTCTATGGGCCTCTGGGGTGGGGGGTAAGTNGGGTAACAGGCAGCCCCCTCTCCTAGGGCCTGCTTGAANNAGTTTGGATCNNnnnn
>DBT067001_11616
GGCTNGGGGGGGGGGTGTCCCGTGGTTAGGGTAGGGAGCCAGGACTCCTGGGTTCTATGGGCCTCTGGGGTGGGGGGTAAGTGGGGTAACAGGCAGCCCCCTCTCCTAGGGCCTGCTTGAAAGAGTTTGGATCnnnn

And after splitting on 'nnnn' and then passing the R2 seqs to muscle_it(), muscle throws an error (*** ERROR *** No sequences in input file) because the seqs list is all empty (in this case ['','']). The error message goes to stderr, so python doesn't see it, and the problem cascades down a bit and then shows up like this (which is a little confusing):

File ~/src/ipyrad/ipyrad/assemble/clustmap_across.py:1417, in muscle_it(proc, names, seqs)
   1415 # reorder b/c muscle doesn't keep order
   1416 lines = "".join(align1)[1:].split("\n>")
-> 1417 dalign1 = dict([i.split("\n", 1) for i in lines])
   1418 keys = sorted(
   1419     dalign1.keys(), 
   1420     key=lambda x: int(x.rsplit("*")[-1])
   1421 )
   1422 seqarr = np.zeros(
   1423     (len(names), len(dalign1[keys[0]].replace("\n", ""))),
   1424     dtype='S1',
   1425 )

ValueError: dictionary update sequence element #0 has length 1; 2 is required
isaacovercast commented 1 year ago

Here's my solution. If all R2 seqs are empty then clip off the 'nnnn' and treat it as R1 or merged data. I'm testing it now.

Inside ipyrad.assemble.clustmap_across.align_to_array():L1349-1362:

        # else locus looks good, align it.
        # is there a paired-insert in any samples in the locus?
        try:

            # try to split cluster list at nnnn separator for each read
            left = [i.split("nnnn")[0] for i in seqs]
            right = [i.split("nnnn")[1] for i in seqs]

            if not any(right):
                # If _all_ R2 seqs are empty then raise the IndexError
                # and treat it as R1 only. Insane edge case, took one entire
                # day to figure out. iao 9/15/22
                seqs = left
                raise IndexError()
isaacovercast commented 1 year ago

Pretty crazy that this has never come up before....

isaacovercast commented 1 year ago

jupyter notebook for debugging this problem, in case it's ever useful.

Step6-ipynb.md

Juliazhou1994 commented 4 months ago

Hi, I have met the same problem like this. But I didn't really understand how to use this file to solve it. Do I need to run this file '[Step6-ipynb.md]' in my bug-jupyter notebook? Is there any code? Can you help me to explain more? Thanks a lot! ~~

isaacovercast commented 4 months ago

@Juliazhou1994 Are you sure it's the same problem? Can you run step 6 with the -d flag and post the full output here?

Juliazhou1994 commented 3 months ago

Hi, here is my log file

Parallel connection | nku-PowerEdge-T640: 60 cores [####################] 100% 1:22:30 | processing reads | s2 | [####################] 100% 1:02:15 | join merged pairs | s3 | [####################] 100% 0:52:51 | join unmerged pairs | s3 | [####################] 100% 0:47:36 | dereplicating | s3 | [####################] 100% 10 days, 8:23:38 | clustering/mapping | s3 | [####################] 100% 0:00:43 | building clusters | s3 | [####################] 100% 0:00:09 | chunking clusters | s3 | [####################] 100% 18:46:03 | aligning clusters | s3 | [####################] 100% 0:01:41 | concat clusters | s3 | [####################] 100% 0:01:11 | calc cluster stats | s3 | [####################] 100% 0:04:47 | inferring [H, E] | s4 | [####################] 100% 0:00:44 | calculating depths | s5 | [####################] 100% 0:01:01 | chunking clusters | s5 | [####################] 100% 1:30:53 | consens calling | s5 | [####################] 100% 0:02:16 | indexing alleles | s5 | [####################] 100% 0:02:31 | concatenating inputs | s6 | [####################] 100% 13:26:50 | clustering across | s6 | [####################] 100% 0:01:13 | building clusters | s6 | [####################] 100% 0:24:48 | aligning clusters | s6 | Encountered an Error. Message: ValueError: dictionary update sequence element #0 has length 1; 2 is required ---------------------------------------------------------------------------ValueError Traceback (most recent call last) in ~/miniconda3/lib/python3.7/site-packages/ipyrad/assemble/clustmap_across.py in align_to_array(data, samples, chunk) 1324 # align separately 1325 istack1 = muscle_it(proc, names, left) -> 1326 istack2 = muscle_it(proc, names, right) 1327 1328 # combine in order ~/miniconda3/lib/python3.7/site-packages/ipyrad/assemble/clustmap_across.py in muscle_it(proc, names, seqs) 1382 # reorder b/c muscle doesn't keep order 1383 lines = "".join(align1)[1:].split("\n>") -> 1384 dalign1 = dict([i.split("\n", 1) for i in lines]) 1385 keys = sorted( 1386 dalign1.keys(), ValueError: dictionary update sequence element #0 has length 1; 2 is required

isaacovercast commented 3 months ago

What version of ipyrad are you using? ipyrad -v. The error message that you show here shows the line number of the problem (1384) which looks like it is not the current line number for that part of the code, which leads me to believe you are using an older version. I believe this problem was fixed in v0.9.85, so please update to the most recent version of ipyrad and try again.

conda update -c bioconda ipyrad