dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
72 stars 40 forks source link

v0.9 Step 7 write_loci_and_alleles ValueError #350

Closed EmilyPhelps closed 5 years ago

EmilyPhelps commented 5 years ago

I know this has come up a few times before. I'm having a lot of trouble with IPYRAD recently although I used it back in March for the first time and it worked perfectly on my computer. I am now trying to run it on a linux server. I have installed/ updated it this week.

It all goes smoothly until step 7 then I am getting:

Step 7: Filter and write output files for 50 Samples
  [####################] 100%  filtering loci        | 0:00:10  

  Encountered an error (see details in ./ipyrad_log.txt)
Error summary is below -------------------------------
error in filter_stacks on chunk 0: ValueError(zero-size array to reduction operation maximum which has no identity)

I have tried to reduce my min_sample _locus from 25 to 4 and it still isn't working- but previously I had it working at 26! I haven't bothered with the pop file since I saw this was causing issues.

Here is my parameters file

------- ipyrad params file (v.0.7.30)-------------------------------------------
ip_MG_test                         ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
./                             ## [1] [project_dir]: Project dir (made in curdir if not present)
                               ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
                               ## [3] [barcodes_path]: Location of barcodes file
./Test/*.fq.gz                 ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo                        ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)
                               ## [6] [reference_sequence]: Location of reference sequence file
pairddrad                      ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
                               ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1,cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
5                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
3                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                          ## [13] [maxdepth]: Max cluster depth within samples
0.85                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
0                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
5, 5                           ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)
8, 8                           ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
5                             ## [21] [min_samples_locus]: Min # samples per locus for output
20, 20                         ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
8, 8                           ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.5                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)
0, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
*                              ## [27] [output_formats]: Output formats (see docs)
                               ## [28] [pop_assign_file]: Path to population assignment file

Let me know if I can give any more information!!

EmilyPhelps commented 5 years ago

So it turns out I had installed a different version of IPYRAD (I think 0.6.27) to try and get around the following error:

ImportError: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /newhome/ep15438/miniconda2/lib/python2.7/site-packages/pysam/libchtslib.so)

but I had forgotten about it because I've been struggling with various issues for weeks.

So I updated it and I am currently getting the above error again without it letting me write any files at all.

I then searched through here and did

conda install pysam=0.10.0 -c bioconda
conda install ipyrad -c ipyrad -f

getting the error

  File "/newhome/ep15438/miniconda2/lib/python2.7/site-packages/pysam/__init__.py", line 5, in <module>
    from pysam.libchtslib import *

repeatedly. Then it just doesnt run anything else. So back to where I was 48 hours ago: With not a single step running! Have I cocked it up further? Help!

isaacovercast commented 5 years ago

Try conda install -c ipyrad ipyrad -f.

There is some weird dependency issue with the conda package rn that is causing this weird 0.6 rollback. Using -f forces the most recent version.

As an alternative you can try the new v0.9 beta branch (much faster). You'll need the python 3 version of miniconda, since the new ipyrad is python 3. Once you install python3 with miniconda you can do: conda install -c bioconda -c ipyrad ipyrad. I would recommend trying this, as we are actively moving folks to the v0.9 version, and will switch support away from the old branch at that time.

EmilyPhelps commented 5 years ago

Thank you! I just did that and then tried to run IPYRAD and got the error message

(base)$ ipyrad -p params-ip_MG_test.txt -s 67 -r -f
  loading Assembly: ip_MG_test
  from saved path: ~/Potamonautes/ip_MG_test.json
Warning: The format of max_Ns_consens should now be a float and was set on load to 0.05
Warning: The format of max_Hs_consens should now be a float and was set on load to 0.05
Traceback (most recent call last):
  File "/newhome/ep15438/miniconda3/bin/ipyrad", line 10, in <module>
    sys.exit(main())
  File "/newhome/ep15438/miniconda3/lib/python3.7/site-packages/ipyrad/__main__.py", line 617, in main
    CLI()
  File "/newhome/ep15438/miniconda3/lib/python3.7/site-packages/ipyrad/__main__.py", line 69, in __init__
    self.get_assembly()
  File "/newhome/ep15438/miniconda3/lib/python3.7/site-packages/ipyrad/__main__.py", line 368, in get_assembly
    data.set_params(key, param)
  File "/newhome/ep15438/miniconda3/lib/python3.7/site-packages/ipyrad/core/assembly.py", line 476, in set_params
    setattr(self.params, param, newvalue)
  File "/newhome/ep15438/miniconda3/lib/python3.7/site-packages/ipyrad/core/params.py", line 551, in max_Ns_consens
    value = float(value)
ValueError: could not convert string to float: '5, 5'

My params file is the same as above.

isaacovercast commented 5 years ago

The params file format is slightly different, easiest just to create a new one. You'll also have to roll back to step 3, as many of the internal file formats are quite different to support the massive performance boost of the new version.

EmilyPhelps commented 5 years ago

So I have made a new params file

------- ipyrad params file (v.0.9.11)-------------------------------------------
ip_MG_test                     ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
/newhome/ep15438/Potamonautes/ipyrad3 ## [1] [project_dir]: Project dir (made in curdir if not present)
                               ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
                               ## [3] [barcodes_path]: Location of barcodes file
./Test/*.fq.gz                     ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo                         ## [5] [assembly_method]: Assembly method (denovo, reference)
                               ## [6] [reference_sequence]: Location of reference sequence file
pairddrad                            ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
                               ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
5                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
3                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                          ## [13] [maxdepth]: Max cluster depth within samples
0.85                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
0                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
0.05                           ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)
0.05                           ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
25                              ## [21] [min_samples_locus]: Min # samples per locus for output
0.2                            ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
8                              ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.5                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)
0, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
*                              ## [27] [output_formats]: Output formats (see docs)
                               ## [28] [pop_assign_file]: Path to population assignment file
                               ## [29] [reference_as_filter]: Reads mapped to this reference are removed in step 3

Then ran to step 3 successfully but during step 4 got the error message tuple index out of range.

 Step 4: Joint estimation of error rate and heterozygosity
  [####################] 100% 0:00:05 | inferring [H, E]       

  Encountered an Error.
  Message: IndexError: tuple index out of range
  Use debug flag (-d) for full code traceback.

Sorry for all the comments!

EmilyPhelps commented 5 years ago

the debug gave me

---------------------------------------------------------------------------IndexError                                Traceback (most recent call last)<string> in <module>
~/miniconda3/lib/python3.7/site-packages/ipyrad/assemble/jointestimate.py in optim(data, sample)
    407     try:
    408         ## get array of all clusters data
--> 409         stacked = stackarray(data, sample)
    410 
    411         ## get base frequencies
~/miniconda3/lib/python3.7/site-packages/ipyrad/assemble/jointestimate.py in stackarray(data, sample)
    337     try:
    338         cutlens[0] = len(data.params.restriction_overhang[0])
--> 339         cutlens[1] = maxlen - len(data.params.restriction_overhang[1])
    340     except TypeError:
    341         pass
IndexError: tuple index out of range
isaacovercast commented 5 years ago

Well, it looks like you haven't specified anything for the restriction_overhang parameter. Since you have pairddrad you need to put in both overhang sequences.

EmilyPhelps commented 5 years ago

hello, thanks for all your help so far. Hopefully this is the last issue I have:

 Step 7: Filtering and formatting output files 
  [####################] 100% 0:00:10 | applying filters       
  [####################] 100% 0:00:14 | building arrays        

  Encountered an Error.
  Message: ValueError: not enough values to unpack (expected 2, got 1)

  Parallel connection closed.
---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)<string> in <module>
~/miniconda3/lib/python3.7/site-packages/ipyrad/assemble/write_outputs.py in write_loci_and_alleles(data)
   1604 
   1605                     all1, all2 = splitalleles(seq)
-> 1606                     aname, spacer = name.split(" ", 1)
   1607                     achunk.append(aname + "_0 " + spacer + all1)
   1608                     achunk.append(aname + "_1 " + spacer + all2)
ValueError: not enough values to unpack (expected 2, got 1)

My parameter file currently looks like this:

------- ipyrad params file (v.0.9.11)-------------------------------------------
ip_MG_test                     ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
/newhome/ep15438/Potamonautes/ipyrad3 ## [1] [project_dir]: Project dir (made in curdir if not present)
                               ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
                               ## [3] [barcodes_path]: Location of barcodes file
./Test/*.fq.gz                 ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo                         ## [5] [assembly_method]: Assembly method (denovo, reference)
                               ## [6] [reference_sequence]: Location of reference sequence file
pairddrad                       ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
CCGG, AATT                     ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
5                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
3                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                          ## [13] [maxdepth]: Max cluster depth within samples
0.85                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
0                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
0.05                           ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)
0.05                          ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
15                             ## [21] [min_samples_locus]: Min # samples per locus for output
0.2, 0.2                            ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
8, 8                           ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.5                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)
0, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
*                              ## [27] [output_formats]: Output formats (see docs)
                               ## [28] [pop_assign_file]: Path to population assignment file
                               ## [29] [reference_as_filter]: Reads mapped to this reference are removed in step 3

Thanks again for all your help, I really appreciate it!

isaacovercast commented 5 years ago

Can I see the first few lines of your barcodes file? Or just show me 4 or 5 different representative sample names.

EmilyPhelps commented 5 years ago

my sample names are

MG_Sample10_R1_.fq.gz
MG_Sample10_R2_.fq.gz
MG_Sample13_R1_.fq.gz
MG_Sample13_R2_.fq.gz
MG_Sample18_R1_.fq.gz
MG_Sample18_R2_.fq.gz
MG_Sample19_R1_.fq.gz
MG_Sample19_R2_.fq.gz

There are 50 samples- the numbering is not consistent as some samples were removed already.

This is what these files look like inside

?}K??8侮2
         ??@??????o1?p0@R?|U?=?|?J?D??????6???????7??ǿr?b??\??R?l?Ԛs??????!"k?U??{?o?????o??޳??S߮?ռ??g??fۣ?{??}?????o?y{???u?p?
                                                                                                                             ?k???v????\=???o??c???_????????f??׷=?K?e?????nO???_??SU??e?6M???y~???z?g??e?????C??~{?????F??~ھ?զ%?M?N??
                                                          {?M??RE??6??k?T,+}I???l?B?п?j???5"6??]]M?????H?K??G2??XĞ\?w??!鰥??XC버ɺa?=?OU'K2?(??5????Ҟ??L??cc]t??L??6?x?>?????0?Ҷ?m?mw?ݜm?aHz??;?GI?F????-?v)-?̠߄????ۂ??o[:?:?:?zkO fZ?Nl???W??K?֓/]Z?"};?p6??^?V????ve?ݗlW*B
                                                                                                ??:?mN?߄????nP?D??,?П8N2?]??g??H???mA?23W??????m??j
                                                                                                                                                   K
                                                                                                                                                    ?3?椆wLoZ_]??????#??i[w?DKQ?zr
       ?Y?]XZ]vf?t??wvO??l??)??:?^?z?}???8??-?
?I????}|L?g??R?Ȫ]??5??F?A?V?5?%t???_]????G#?m?M???F?_??~?%(?$?F?4??Vמ

When expanded the files look like this (this is MG_Sample3R2.fq)

@1_11101_11247_1004_2
CGGGAAGAAAGTTCTGTGACGGGGTAGACGCTTAAGTTGAATGTCCTGAAAACTTACCACTGGCGTAGGTAAAAGAATTGATCG
+
AA6AAEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEA
@1_11101_7843_1010_2
CGGTGTCTGAGAACGAGACACCA

Sorry if this is a really basic error!

EmilyPhelps commented 5 years ago

Hiya -I was having another think about this today and I changed the [max_shared_Hs_locus] to having two numbers so from:

0.5                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)

to:

0.5 , 0.5                           ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)

I got the error message

ValueError: could not convert string to float: '0.5, 0.5'

Could this be the issue? Is there a quick fix for this?

isaacovercast commented 5 years ago

This parameter only accepts one value in the 0.9 branch.

EmilyPhelps commented 5 years ago

Okay thanks- back to this ValueError: not enough values to unpack (expected 2, got 1) error :/

isaacovercast commented 5 years ago

When you switched to v0.9 did you re-run step 6?

EmilyPhelps commented 5 years ago

Yep, I re-ran the whole pipeline just in case.

isaacovercast commented 5 years ago

Hm, well there's something unusual going on. Any way you can dropbox me the raw data and the barcodes file?

isaacovercast commented 5 years ago

During the filtering/chunking phase, if all loci in a chunk do not pass min_samples_locus filtering then it creates an empty .loci file, which cascades down to this error in a very non-straightforward fashion. Chunks w/ no loci passing filtering maybe just shouldn't be written to .loci files?

isaacovercast commented 5 years ago

Fixed: 2aebb35

Handle chunks with no loci passing filtering. Don't write out tmpfiles for .loci/.p/.npy in the tmpdir.