dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

ipyrad (reference version, pairddrad) step 7 crash -- Message: KeyError: 68 #508

Open imaa9 opened 1 year ago

imaa9 commented 1 year ago

Hi Isaac,

I'm running ipyrad [v.0.9.90] with maximum memory allocation (184G), 48 threads, and with the following (relevant) params:

~/all_trimmed_reads/*.fq                       ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted/trimmed/unzipped fastq files
reference                                                ## [5] [assembly_method]: Assembly method 
~/reference1.1.fa                                   ## [6] [reference_sequence]: Location of reference sequence file
pairddrad                                               ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
AATTC, GCATG                                        ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2) [EcoRI, SphI]
5                                                             ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                                                           ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
5                                                             ## [11] [mindepth_statistical]: Min depth for statistical base calling
5                                                             ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                                                     ## [13] [maxdepth]: Max cluster depth within samples [default = 10,000]
0.86                                                        ## [14] [clust_threshold]: Clustering threshold for de novo assembly
2                                                             ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
0.1                                                          ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus
0.1                                                          ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus
5                                                             ## [21] [min_samples_locus]: GLOBAL Min # samples per locus 
0.25                                                        ## [22] [max_SNPs_locus]: Max % SNPs per locus 
8                                                             ## [23] [max_Indels_locus]: Max # of indels per locus
0.5                                                          ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus
*                                                             ## [27] [output_formats]: Output formats (see docs) [* = all of them]

Thus runs fine through part 2 of step 7:

  Step 7: Filtering and formatting output files
  [####################] 100% 0:05:58 | applying filters
  [####################] 100% 1:02:59 | building arrays

  Encountered an Error.
  Message: KeyError: 68
  Parallel connection closed.

Here is the traceback info:

KeyError                                  Traceback (most recent call last)
File <string>:1, in <module>

File ~/.conda/envs/ipyrad/lib/python3.10/site-packages/ipyrad/assemble/write_outputs.py:2158, in fill_snp_array(data, ntaxa, nsnps)
   2156 # fill for each taxon
   2157 for sidx in range(ntaxa):
-> 2158     resos = [DCONS[i] for i in snparr[sidx, :]]
   2160     # pseudoref version
   2161     io5['genos'][:, sidx, :] = get_genos(
   2162         np.array([i[0] for i in resos]),
   2163         np.array([i[1] for i in resos]),
   2164         io5['pseudoref'][:]
   2165     )

File ~/.conda/envs/ipyrad/lib/python3.10/site-packages/ipyrad/assemble/write_outputs.py:2158, in <listcomp>(.0)
   2156 # fill for each taxon
   2157 for sidx in range(ntaxa):
-> 2158     resos = [DCONS[i] for i in snparr[sidx, :]]
   2160     # pseudoref version
   2161     io5['genos'][:, sidx, :] = get_genos(
   2162         np.array([i[0] for i in resos]),
   2163         np.array([i[1] for i in resos]),
   2164         io5['pseudoref'][:]
   2165     )

KeyError: 68

I followed the suggestion of a previous issue about using a reference genome with masked ambiguous bases (I just converted each to one of the possible resolution options) and tried running step 7 again with that, but it failed as above. Do I need to run the entire pipeline again from the beginning using the unambiguated reference, or is there something else that's causing this error in step 7? any insights would be much appreciated!

Thanks, Inbar

isaacovercast commented 1 year ago

Yes, ambig bases in the reference will cause problems, so it's good you found that and fixed it. By the time of step 7 all the formal assembly has been completed, so fixing the reference sequence will require to roll back and re-run from at least step 3 (including the -f flag) in order for the change in reference fix this error at step 7. Let me know how it goes....

imaa9 commented 1 year ago

cool, many thanks for the quick reply! I'll run it again from the start, I think that should fix it. Just wanted to make sure this was the issue before I submit this big job again.