I encountered on overflow error on step 7 when running with a reference genome. I fixed it on my system by changing this code in write_outputs.py from uint16 to uint32

(R1>, <R1, R2>, <R2)

    self.edges = np.zeros((self.chunksize, 4), dtype=np.uint16)

But I thought I would post something in case anyone else runs into this error. I have no idea if this is just some weird issue with my dataset, but it seems to have happened because the chromosome positions in the reference genome are too large to be stored in an array as uint16.

Below is the full breakdown.

Here is the output with error:

loading Assembly: data_ref from saved path: /group/dpottergrp/Reed/sp_delim/analysis/ipyrad/data_ref.json

ipyrad [v.0.9.96] Interactive assembly and analysis of RAD-seq data

Parallel connection | cpu-6-62: 60 cores

Step 7: Filtering and formatting output files

Encountered an Error. Message: OverflowError: Python integer 65778 out of bounds for uint16 Parallel connection closed. [0;31m---------------------------------------------------------------------------[0m [0;31mOverflowError[0m Traceback (most recent call last) File [0;32m:1[0m

File [0;32m~/mambaforge/envs/ipyrad/lib/python3.12/site-packages/ipyrad/assemble/write_outputs.py:608[0m, in [0;36mprocess_chunk[0;34m(data, chunksize, chunkfile)[0m [1;32m 605[0m [38;5;28;01mdef[39;00m [38;5;21mprocess_chunk[39m(data, chunksize, chunkfile): [1;32m 606[0m [38;5;66;03m# process chunk writes to files and returns proc with features.[39;00m [1;32m 607[0m proc [38;5;241m=[39m Processor(data, chunksize, chunkfile) [0;32m--> 608[0m [43mproc[49m[38;5;241;43m.[39;49m[43mrun[49m[43m([49m[43m)[49m [1;32m 610[0m [38;5;66;03m# check for variants or set max to 0[39;00m [1;32m 611[0m [38;5;28;01mtry[39;00m:

File [0;32m~/mambaforge/envs/ipyrad/lib/python3.12/site-packages/ipyrad/assemble/write_outputs.py:852[0m, in [0;36mProcessor.run[0;34m(self)[0m [1;32m 849[0m [38;5;28mself[39m[38;5;241m.[39mpis[snparr[:, [38;5;241m1[39m][38;5;241m.[39msum()] [38;5;241m+[39m[38;5;241m=[39m [38;5;241m1[39m
[1;32m 851[0m [38;5;66;03m# write to .loci string[39;00m [0;32m--> 852[0m locus [38;5;241m=[39m [38;5;28;43mself[39;49m[38;5;241;43m.[39;49m[43mto_locus[49m[43m([49m[43mablock[49m[43m,[49m[43m [49m[43msnparr[49m[43m,[49m[43m [49m[43medg[49m[43m)[49m [1;32m 853[0m [38;5;28mself[39m[38;5;241m.[39moutlist[38;5;241m.[39mappend(locus) [1;32m 855[0m [38;5;66;03m# If no loci survive filtering then don't write the files[39;00m

File [0;32m~/mambaforge/envs/ipyrad/lib/python3.12/site-packages/ipyrad/assemble/write_outputs.py:889[0m, in [0;36mProcessor.to_locus[0;34m(self, block, snparr, edg)[0m [1;32m 887[0m chrom, pos [38;5;241m=[39m refpos[38;5;241m.[39msplit([38;5;124m"[39m[38;5;124m:[39m[38;5;124m"[39m) [1;32m 888[0m ostart, end [38;5;241m=[39m pos[38;5;241m.[39msplit([38;5;124m"[39m[38;5;124m-[39m[38;5;124m"[39m) [0;32m--> 889[0m start [38;5;241m=[39m [38;5;28;43mint[39;49m[43m([49m[43mostart[49m[43m)[49m[43m [49m[38;5;241;43m+[39;49m[43m [49m[43medg[49m[43m[[49m[38;5;241;43m0[39;49m[43m][49m [1;32m 890[0m end [38;5;241m=[39m start [38;5;241m+[39m (edg[[38;5;241m3[39m] [38;5;241m-[39m edg[[38;5;241m0[39m]) [1;32m 892[0m [38;5;66;03m# get consens hit indexes and start positions[39;00m

[0;31mOverflowError[0m: Python integer 65778 out of bounds for uint16

The first entry in the clust_database.fa has the location 65778-65814, so it seems that when it encounters the location it errors out.

reference_0:1:65778-65814 CCCGGTTCGTACCAACCAAATCCCGAGAGAAATACC

Like I said above, the solution is fairly simple. You just have to allow the edges array to store values larger than allowed by uint16, which maxes out at 65535. You just have to change this line of code in write_outputs.py

(R1>, <R1, R2>, <R2)

    self.edges = np.zeros((self.chunksize, 4), dtype=np.uint16)

to this

(R1>, <R1, R2>, <R2)

    self.edges = np.zeros((self.chunksize, 4), dtype=np.uint32)

I have not yet moved on to any downstream analysis, so hopefully this has not unforeseen affects. Otherwise everything was super smooth, thanks for writing such a great pipeline!

dereneaton / ipyrad

Overflow error in step 7 with reference genome #571

(R1>, <R1, R2>, <R2)

(R1>, <R1, R2>, <R2)

(R1>, <R1, R2>, <R2)