Closed reedjohnkenny closed 3 months ago
Thanks for the detailed report. In fact this is related to reference assemblies. This uint16 issue has actually been a problem the whole time, but it has been masked in numpy <2.0 versions (by silently casting to uint64). numpy recently rolled out 2.0 which changed this behavior.
numpy 1.24.4 behavior:
>>> a = np.uint16(10)
>>> b = 10000000
>>> a+b
10000010
numpy 2.0.1 behavior:
>>> import numpy as np
>>> a = np.uint16(10)
>>> b = 10000000
>>> a+b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: Python integer 10000000 out of bounds for uint16
I modified write_outputs.py. If isref == True
it uses uint32, if denovo it uses uint16.
I encountered on overflow error on step 7 when running with a reference genome. I fixed it on my system by changing this code in write_outputs.py from uint16 to uint32
(R1>, <R1, R2>, <R2)
But I thought I would post something in case anyone else runs into this error. I have no idea if this is just some weird issue with my dataset, but it seems to have happened because the chromosome positions in the reference genome are too large to be stored in an array as uint16.
Below is the full breakdown.
Here is the output with error:
loading Assembly: data_ref from saved path: /group/dpottergrp/Reed/sp_delim/analysis/ipyrad/data_ref.json
ipyrad [v.0.9.96] Interactive assembly and analysis of RAD-seq data
Parallel connection | cpu-6-62: 60 cores
Step 7: Filtering and formatting output files
[ ] 0% 0:00:02 | applying filters
[ ] 0% 0:00:03 | applying filters
[ ] 0% 0:00:03 | applying filters
[ ] 0% 0:00:04 | applying filters
[ ] 0% 0:00:04 | applying filters
[ ] 0% 0:00:05 | applying filters
[ ] 0% 0:00:05 | applying filters
[ ] 0% 0:00:06 | applying filters
[ ] 0% 0:00:06 | applying filters
[ ] 0% 0:00:07 | applying filters
[ ] 0% 0:00:07 | applying filters
[ ] 0% 0:00:08 | applying filters
[ ] 0% 0:00:08 | applying filters
[ ] 0% 0:00:09 | applying filters
[ ] 2% 0:00:09 | applying filters
[## ] 14% 0:00:10 | applying filters
[##### ] 29% 0:00:10 | applying filters
[######## ] 41% 0:00:11 | applying filters
[########### ] 58% 0:00:11 | applying filters
[################# ] 85% 0:00:12 | applying filters
[####################] 100% 0:00:12 | applying filters
Encountered an Error. Message: OverflowError: Python integer 65778 out of bounds for uint16 Parallel connection closed. [0;31m---------------------------------------------------------------------------[0m [0;31mOverflowError[0m Traceback (most recent call last) File [0;32m:1[0m
File [0;32m~/mambaforge/envs/ipyrad/lib/python3.12/site-packages/ipyrad/assemble/write_outputs.py:608[0m, in [0;36mprocess_chunk[0;34m(data, chunksize, chunkfile)[0m [1;32m 605[0m [38;5;28;01mdef[39;00m [38;5;21mprocess_chunk[39m(data, chunksize, chunkfile): [1;32m 606[0m [38;5;66;03m# process chunk writes to files and returns proc with features.[39;00m [1;32m 607[0m proc [38;5;241m=[39m Processor(data, chunksize, chunkfile) [0;32m--> 608[0m [43mproc[49m[38;5;241;43m.[39;49m[43mrun[49m[43m([49m[43m)[49m [1;32m 610[0m [38;5;66;03m# check for variants or set max to 0[39;00m [1;32m 611[0m [38;5;28;01mtry[39;00m:
File [0;32m~/mambaforge/envs/ipyrad/lib/python3.12/site-packages/ipyrad/assemble/write_outputs.py:852[0m, in [0;36mProcessor.run[0;34m(self)[0m [1;32m 849[0m [38;5;28mself[39m[38;5;241m.[39mpis[snparr[:, [38;5;241m1[39m][38;5;241m.[39msum()] [38;5;241m+[39m[38;5;241m=[39m [38;5;241m1[39m
[1;32m 851[0m [38;5;66;03m# write to .loci string[39;00m [0;32m--> 852[0m locus [38;5;241m=[39m [38;5;28;43mself[39;49m[38;5;241;43m.[39;49m[43mto_locus[49m[43m([49m[43mablock[49m[43m,[49m[43m [49m[43msnparr[49m[43m,[49m[43m [49m[43medg[49m[43m)[49m [1;32m 853[0m [38;5;28mself[39m[38;5;241m.[39moutlist[38;5;241m.[39mappend(locus) [1;32m 855[0m [38;5;66;03m# If no loci survive filtering then don't write the files[39;00m
File [0;32m~/mambaforge/envs/ipyrad/lib/python3.12/site-packages/ipyrad/assemble/write_outputs.py:889[0m, in [0;36mProcessor.to_locus[0;34m(self, block, snparr, edg)[0m [1;32m 887[0m chrom, pos [38;5;241m=[39m refpos[38;5;241m.[39msplit([38;5;124m"[39m[38;5;124m:[39m[38;5;124m"[39m) [1;32m 888[0m ostart, end [38;5;241m=[39m pos[38;5;241m.[39msplit([38;5;124m"[39m[38;5;124m-[39m[38;5;124m"[39m) [0;32m--> 889[0m start [38;5;241m=[39m [38;5;28;43mint[39;49m[43m([49m[43mostart[49m[43m)[49m[43m [49m[38;5;241;43m+[39;49m[43m [49m[43medg[49m[43m[[49m[38;5;241;43m0[39;49m[43m][49m [1;32m 890[0m end [38;5;241m=[39m start [38;5;241m+[39m (edg[[38;5;241m3[39m] [38;5;241m-[39m edg[[38;5;241m0[39m]) [1;32m 892[0m [38;5;66;03m# get consens hit indexes and start positions[39;00m
[0;31mOverflowError[0m: Python integer 65778 out of bounds for uint16
The first entry in the clust_database.fa has the location 65778-65814, so it seems that when it encounters the location it errors out.
Like I said above, the solution is fairly simple. You just have to allow the edges array to store values larger than allowed by uint16, which maxes out at 65535. You just have to change this line of code in write_outputs.py
(R1>, <R1, R2>, <R2)
to this
(R1>, <R1, R2>, <R2)
I have not yet moved on to any downstream analysis, so hopefully this has not unforeseen affects. Otherwise everything was super smooth, thanks for writing such a great pipeline!