Closed isaacovercast closed 3 years ago
I got the .json and the clust_database.fa file from the user and I can reproduce this error.
# checked for py2/3 (keeping name order straight important)
lidx = 0
for name in snames:
if name in tmploc:
sidx = sidxs[name]
tmparr[sidx, start:end] = loc[lidx]
lidx += 1
Here tmploc is empty.
Ok, i understand what's happening now. For each chunk file it iterates through filling one locus at a time and dumping when the accumulated length is greater than 'maxsize'. Now, when it dumps, it resets a bunch of internal state, including tmparr = np.zeros((ntaxa, maxsize + 50000), dtype=np.uint8)
, and if this is the LAST locus in the file then it exits the for loop and tries to apply trimming to this as if it were a viable final chunk, which it's not, so after line 1948 trim = np.where(tmparr != 0)[1]
trim == 150000, and a bunch of nasty stuff happens.
There should be a test somewhere after the for loop exits to determine whether there is any valid data remaining in the tmparr.
@isaacovercast I see, so the H5 databases are fine, the .loci and stats files fine, but it is just failing when it is filtering sites to write the phylip output, right? And this seems to be related to the chunksize it uses as a convenience to reduce memory usage. If so, I think you're right its just a problem with checking the last chunk size. And I don't think this should affect anything outside of this function.
Ok cool. There is an obvious duplicate code block inside the for loop and also immediately after, so it seems the check for last chunk size should come before the call on line 1951, in other words before this call:
phymap[1:, 1] = phymap[:-1, 2]
What is this doing anyway, out of curiosity? Some kind of housekeeping? I went back to look at the phymap format to figure out whats going on here but i didn't feel like parsing it.
On Wed, Oct 14, 2020 at 3:23 PM Eaton Lab notifications@github.com wrote:
@isaacovercast https://github.com/isaacovercast I see, so the H5 databases are fine, the .loci and stats files fine, but it is just failing when it is filtering sites to write the phylip output, right? And this seems to be related to the chunksize it uses as a convenience to reduce memory usage. If so, I think you're right its just a problem with checking the last chunk size. And I don't think this should affect anything outside of this function.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/422#issuecomment-708398686, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNSXP3HEK4J7UNJN2MRAJ3SKWQ4PANCNFSM4SPR2F2Q .
The phymap is very important to keep right. This is used in the analysis tools to extract windows of loci by knowing where they start and stop within the H5 sequence matrix, and for ref mapped data it also keeps track of how those positions align with the ref genome. I'll have to look into what this chunk of code is doing again when I have some time soon, but its related to making the phymap correct I assume.
I get it. The columns of phymap are 'chrom', phy0, phy1, pos0, pos1. While filling the seq array we only track phy1 (the end position), so at the end phy0 is a column of 0s. This wacky thing `phymap[1:, 1] = phymap[:-1, 2] takes the end column shifts it up by one and moves it to the phy0 column, essentially the beginning of one locus is the same as the end of the previous locus. Before:
array([[ 0, 0, 134, 0, 0],
[ 1, 0, 269, 0, 0],
[ 2, 0, 403, 0, 0],
[ 3, 0, 537, 0, 0],
[ 4, 0, 671, 0, 0],
[ 5, 0, 805, 0, 0],
[ 6, 0, 939, 0, 0],
[ 7, 0, 1073, 0, 0],
[ 8, 0, 1207, 0, 0],
[ 9, 0, 1341, 0, 0]], dtype=uint64)
After:
array([[ 0, 0, 134, 0, 0],
[ 1, 134, 269, 0, 0],
[ 2, 269, 403, 0, 0],
[ 3, 403, 537, 0, 0],
[ 4, 537, 671, 0, 0],
[ 5, 671, 805, 0, 0],
[ 6, 805, 939, 0, 0],
[ 7, 939, 1073, 0, 0],
[ 8, 1073, 1207, 0, 0],
[ 9, 1207, 1341, 0, 0]], dtype=uint64)
Ah, beautiful numpy.
This is how I handled it:
if start == 0 and end == 0:
# The last chunk fell exactly on the maxsize boundary so it has
# already been trimmed and dumped to the phy. In this case the for
# loop on the line iterator has fallen through (no more data) so
# there is no final chunk to trim.
pass
else:
# trim final chunk tmparr to size
trim = np.where(tmparr != 0)[1]
if trim.size:
trim = trim.max() + 1
else:
trim = tmparr.shape[1]
# fill missing with 78 (N)
tmparr[tmparr == 0] = 78
# dump tmparr and maplist to hdf5. Because we dropped sites that are
# all N or - the length of phy data can be less than nbases and so
# there can be 000 at the end of phy. This is ok, we trim it when
# writing phylip file, but good to be aware it's there for other things
phy[:, gstart:gstart + trim] = tmparr[:, :trim]
phymap[mapstart:locidx, 0] = mapchroms
phymap[mapstart:locidx, 2] = mapends
if data.isref:
phymap[mapstart:locidx, 3] = mappos0
phymap[mapstart:locidx, 4] = mappos1
phymap[1:, 1] = phymap[:-1, 2]
It solves the problem on the empirical dataset I was working with. Also, a diff of the .phy from simulated data before and after this change show identical files, so that's reassuring. I'm closing this issue, but if you want to change how this is handled let me know.
From gitter: