Thanks for the detailed report. In fact this is related to reference assemblies. This uint16 issue has actually been a problem the whole time, but it has been masked in numpy <2.0 versions (by silently casting to uint64). numpy recently rolled out 2.0 which changed this behavior.
numpy 1.24.4 behavior:
>>> a = np.uint16(10)
>>> b = 10000000
>>> a+b
numpy 2.0.1 behavior:
>>> import numpy as np
>>> a = np.uint16(10)
>>> b = 10000000
>>> a+b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: Python integer 10000000 out of bounds for uint16
I modified If isref == True
it uses uint32, if denovo it uses uint16.
I encountered on overflow error on step 7 when running with a reference genome. I fixed it on my system by changing this code in from uint16 to uint32
But I thought I would post something in case anyone else runs into this error. I have no idea if this is just some weird issue with my dataset, but it seems to have happened because the chromosome positions in the reference genome are too large to be stored in an array as uint16.
Below is the full breakdown.
Here is the output with error:
loading Assembly: data_ref from saved path: /group/dpottergrp/Reed/sp_delim/analysis/ipyrad/data_ref.json
ipyrad [v.0.9.96] Interactive assembly and analysis of RAD-seq data
Parallel connection | cpu-6-62: 60 cores
Step 7: Filtering and formatting output files
The first entry in the clust_database.fa has the location 65778-65814, so it seems that when it encounters the location it errors out.
Like I said above, the solution is fairly simple. You just have to allow the edges array to store values larger than allowed by uint16, which maxes out at 65535. You just have to change this line of code in
to this
I have not yet moved on to any downstream analysis, so hopefully this has not unforeseen affects. Otherwise everything was super smooth, thanks for writing such a great pipeline!