dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

ipyrad reference assembly: Key error 7372 #556

Closed frsepulveda closed 2 months ago

frsepulveda commented 2 months ago

Hi! I was running ipyrad v.0.9.92 on ubuntu 22.04 with a reference SNP calling and a error was found. When I run the step 7, I received "Key error 7372". I already use this data set to do a reference SNP calling and don't have any problems. Now, I was doing it again just to check some results and encounter this error.

This is some relevant information:

Params:

/media/server/a77f75fe-fd07-402e-84d7-a7341c29141c/Pancho/radseq/original_data/demu/Demultiplex/*.fq.gz ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files reference ## [5] [assembly_method]: Assembly method (denovo, reference) /media/server/a77f75fe-fd07-402e-84d7-a7341c29141c/Pancho/alga_genomas/Mazzaella/POS_Hembra/POS_Hembra.fasta ## [6] [reference_sequence]: Location of reference sequence file pairddrad ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc. TGCAG ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2) 5 ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read 33 ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard) 6 ## [11] [mindepth_statistical]: Min depth for statistical base calling 6 ## [12] [mindepth_majrule]: Min depth for majority-rule base calling 10000 ## [13] [maxdepth]: Max cluster depth within samples 0.85 ## [14] [clust_threshold]: Clustering threshold for de novo assembly 0 ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes 2 ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter) 35 ## [17] [filter_min_trim_len]: Min length of reads after adapter trim 2 ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences 0.05 ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus 0.05 ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus 4 ## [21] [min_samples_locus]: Min # samples per locus for output 0.2 ## [22] [max_SNPs_locus]: Max # SNPs per locus 8 ## [23] [max_Indels_locus]: Max # of indels per locus 0.5 ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus 0, 0, 0, 0 ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs) 0, 0, 0, 0 ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2) v ## [27] [output_formats]: Output formats (see docs)

Debug:

KeyError Traceback (most recent call last) File :1

File ~/miniconda3/envs/ipyrad/lib/python3.10/site-packages/ipyrad/assemble/write_outputs.py:608, in process_chunk(data, chunksize, chunkfile) 605 def process_chunk(data, chunksize, chunkfile): 606 # process chunk writes to files and returns proc with features. 607 proc = Processor(data, chunksize, chunkfile) --> 608 proc.run() 610 # check for variants or set max to 0 611 try:

File ~/miniconda3/envs/ipyrad/lib/python3.10/site-packages/ipyrad/assemble/write_outputs.py:848, in Processor.run(self) 846 else: 847 self.nbases += self.masked.shape[1] --> 848 self.var[snparr[:, :].sum()] += 1 849 self.pis[snparr[:, 1].sum()] += 1
851 # write to .loci string

KeyError: 7372

imagen

isaacovercast commented 2 months ago

You say that you ran it already and it worked, but did you use the same version of ipyrad? Did you change any parameters for this run?

This error is cropping up because it has found a locus with 7372 variable sites. This seems like a very extreme value (the maximum possible value we allow for is 5000), which is probably caused by a HUGE contig-like region of overlapping reads. You can try to filter this out by reducing the max_SNPs_locus parameter.

If you want to retain this locus you can edit ipyrad/assemble/write_outputs.py and change this line (L703):

        self.var = {i: 0 for i in range(5000)}

To increase 5000 to something bigger than 7372. It is probably a really weird locus though so I wouldn't count on it being useful or making sense.

You might take a look at the _across/_clust_database.fa file to check how the loci are looking before step 7. I'll bet there are some of them that are just massive and probably very very sparse.

Is this RADSeq data? Maybe the genome size is really small?

frsepulveda commented 2 months ago

Hi Isaac! Thanks for the reply.

You say that you ran it already and it worked, but did you use the same version of ipyrad? Did you change any parameters for this run? R: Yes, i just the same data set and the same params file as always, and this error appears.

Is this RADSeq data? Maybe the genome size is really small? R: Is small, about 100Mb from a red algae, so is highly repetitive genome (about 50%).

Reduce the "max_SNPs_locus" 0.2 to 0.05 works fine!

Thanks!

isaacovercast commented 2 months ago

Wonderful! Thanks for letting me know! Glad it worked.