dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

Step 7: TypeError: Can't broadcast (47, 150000) -> (47, 4909) #422

Closed isaacovercast closed 3 years ago

isaacovercast commented 3 years ago

From gitter:

I am getting the following error on step 7, building arrays, de novo, v. 0.9.59: Message: TypeError: Can't broadcast (47, 150000) -> (47, 4909) I have searched for similar errors, and most have come on earlier steps, dealing with branching errors or long reads slipping through filters. I am following the pipeline of McCarntey-Melstad et al. 2019 to determine the optimal clustering threshold by graphing various output stats for different thresholds. I have successfully assembled 5 other clustering thresholds, but for the largest (0.97) I am getting this error. All other parameter values are identical as previous assemblies. When I tried running this same clustering threshold with a higher "max_samples_locus" (increased from 4 to 40), the output files write with no error. Running with the -d flag produces the following report:

---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)<string> in <module>
~/miniconda3/envs/iPyrad/lib/python3.7/site-packages/ipyrad/assemble/write_outputs.py in fill_seq_array(data, ntaxa, nbases, nloci)
   1943         # there can be 000 at the end of phy. This is ok, we trim it when
   1944         # writing phylip file, but good to be aware it's there for other things
-> 1945         phy[:, gstart:gstart + trim] = tmparr[:, :trim]
   1946         phymap[mapstart:locidx, 0] = mapchroms
   1947         phymap[mapstart:locidx, 2] = mapends
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
~/miniconda3/envs/iPyrad/lib/python3.7/site-packages/h5py/_hl/dataset.py in __setitem__(self, args, val)
    705             mshape_pad = mshape
    706         mspace = h5s.create_simple(mshape_pad, (h5s.UNLIMITED,)*len(mshape_pad))
--> 707         for fspace in selection.broadcast(mshape):
    708             self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
    709 
~/miniconda3/envs/iPyrad/lib/python3.7/site-packages/h5py/_hl/selections.py in broadcast(self, target_shape)
    297                     tshape.append(t)
    298                 else:
--> 299                     raise TypeError("Can't broadcast %s -> %s" % (target_shape, self.mshape))
    300 
    301         if any([n > 1 for n in target]):
TypeError: Can't broadcast (47, 150000) -> (47, 4909)
Reports that it works for minsamples 40 but not 4. And that it works with minsamples 4 with clustthreshold 0.96.
isaacovercast commented 3 years ago

I got the .json and the clust_database.fa file from the user and I can reproduce this error.

isaacovercast commented 3 years ago
                    # checked for py2/3 (keeping name order straight important)
                    lidx = 0
                    for name in snames:
                        if name in tmploc:
                            sidx = sidxs[name]
                            tmparr[sidx, start:end] = loc[lidx]
                            lidx += 1

Here tmploc is empty.

isaacovercast commented 3 years ago

Ok, i understand what's happening now. For each chunk file it iterates through filling one locus at a time and dumping when the accumulated length is greater than 'maxsize'. Now, when it dumps, it resets a bunch of internal state, including tmparr = np.zeros((ntaxa, maxsize + 50000), dtype=np.uint8), and if this is the LAST locus in the file then it exits the for loop and tries to apply trimming to this as if it were a viable final chunk, which it's not, so after line 1948 trim = np.where(tmparr != 0)[1] trim == 150000, and a bunch of nasty stuff happens.

isaacovercast commented 3 years ago

There should be a test somewhere after the for loop exits to determine whether there is any valid data remaining in the tmparr.

eaton-lab commented 3 years ago

@isaacovercast I see, so the H5 databases are fine, the .loci and stats files fine, but it is just failing when it is filtering sites to write the phylip output, right? And this seems to be related to the chunksize it uses as a convenience to reduce memory usage. If so, I think you're right its just a problem with checking the last chunk size. And I don't think this should affect anything outside of this function.

isaacovercast commented 3 years ago

Ok cool. There is an obvious duplicate code block inside the for loop and also immediately after, so it seems the check for last chunk size should come before the call on line 1951, in other words before this call:

phymap[1:, 1] = phymap[:-1, 2]

What is this doing anyway, out of curiosity? Some kind of housekeeping? I went back to look at the phymap format to figure out whats going on here but i didn't feel like parsing it.

On Wed, Oct 14, 2020 at 3:23 PM Eaton Lab notifications@github.com wrote:

@isaacovercast https://github.com/isaacovercast I see, so the H5 databases are fine, the .loci and stats files fine, but it is just failing when it is filtering sites to write the phylip output, right? And this seems to be related to the chunksize it uses as a convenience to reduce memory usage. If so, I think you're right its just a problem with checking the last chunk size. And I don't think this should affect anything outside of this function.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/422#issuecomment-708398686, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNSXP3HEK4J7UNJN2MRAJ3SKWQ4PANCNFSM4SPR2F2Q .

eaton-lab commented 3 years ago

The phymap is very important to keep right. This is used in the analysis tools to extract windows of loci by knowing where they start and stop within the H5 sequence matrix, and for ref mapped data it also keeps track of how those positions align with the ref genome. I'll have to look into what this chunk of code is doing again when I have some time soon, but its related to making the phymap correct I assume.

isaacovercast commented 3 years ago

I get it. The columns of phymap are 'chrom', phy0, phy1, pos0, pos1. While filling the seq array we only track phy1 (the end position), so at the end phy0 is a column of 0s. This wacky thing `phymap[1:, 1] = phymap[:-1, 2] takes the end column shifts it up by one and moves it to the phy0 column, essentially the beginning of one locus is the same as the end of the previous locus. Before:

array([[   0,    0,  134,    0,    0],
       [   1,    0,  269,    0,    0],
       [   2,    0,  403,    0,    0],
       [   3,    0,  537,    0,    0],
       [   4,    0,  671,    0,    0],
       [   5,    0,  805,    0,    0],
       [   6,    0,  939,    0,    0],
       [   7,    0, 1073,    0,    0],
       [   8,    0, 1207,    0,    0],
       [   9,    0, 1341,    0,    0]], dtype=uint64)

After:

array([[   0,    0,  134,    0,    0],
       [   1,  134,  269,    0,    0],
       [   2,  269,  403,    0,    0],
       [   3,  403,  537,    0,    0],
       [   4,  537,  671,    0,    0],
       [   5,  671,  805,    0,    0],
       [   6,  805,  939,    0,    0],
       [   7,  939, 1073,    0,    0],
       [   8, 1073, 1207,    0,    0],
       [   9, 1207, 1341,    0,    0]], dtype=uint64)
eaton-lab commented 3 years ago

Ah, beautiful numpy.

isaacovercast commented 3 years ago

This is how I handled it:

        if start == 0 and end == 0:
            # The last chunk fell exactly on the maxsize boundary so it has
            # already been trimmed and dumped to the phy. In this case the for
            # loop on the line iterator has fallen through (no more data) so
            # there is no final chunk to trim.
            pass
        else:
            # trim final chunk tmparr to size
            trim = np.where(tmparr != 0)[1]
            if trim.size:
                trim = trim.max() + 1
            else:
                trim = tmparr.shape[1]

            # fill missing with 78 (N)
            tmparr[tmparr == 0] = 78

            # dump tmparr and maplist to hdf5. Because we dropped sites that are 
            # all N or - the length of phy data can be less than nbases and so 
            # there can be 000 at the end of phy. This is ok, we trim it when
            # writing phylip file, but good to be aware it's there for other things
            phy[:, gstart:gstart + trim] = tmparr[:, :trim]
            phymap[mapstart:locidx, 0] = mapchroms
            phymap[mapstart:locidx, 2] = mapends
            if data.isref:
                phymap[mapstart:locidx, 3] = mappos0
                phymap[mapstart:locidx, 4] = mappos1
        phymap[1:, 1] = phymap[:-1, 2]

It solves the problem on the empirical dataset I was working with. Also, a diff of the .phy from simulated data before and after this change show identical files, so that's reassuring. I'm closing this issue, but if you want to change how this is handled let me know.