dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

Step 7 - No loci pass filtering (filtered by rm_duplicates) #429

Closed isaacovercast closed 3 years ago

isaacovercast commented 3 years ago

This was a monster to debug, which included an incredible stroke of luck. Someone on gitter reported an issue where they were getting all loci filtered by rm_duplicates. I got some of their raw data, and installed ipyrad in a new conda env on an old mac that I used to develop on a million years ago, but haven't touched in a long time. I was able to reproduce this and tracked it down to Processor.next_locus() lines 742-743:

        # filter to include only samples in this assembly
        mask = [i in self.data.snames for i in self.names]
        self.names = np.array(self.names)[mask].tolist()

In my testing notebook this code was giving an error /Users/iovercast/Documents/tmp/Hinojosa/ipyrad/ipyrad/assemble/write_outputs.py:1: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index and self.names was returning a list of identical sample names. It turns out that older versions of numpy can give weird behavior if you use a boolean index that is a raw list and not an ndarray e.g. stackoverflow. It treats the boolean list as an index selection, rather than a boolean mask, so it returns a list of copies of names at the 1st index (True == 1). FML!

Mac box, old numpy (doesn't work):

>>> import numpy as np
>>> np.__version__
'1.11.3'
>>> wat = np.array(["wat", "do"])
>>> mask = [True, True]
>>> wat[mask]
__main__:1: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
array(['do', 'do'],
      dtype='<U3')

Linux box, current numpy (works):

>>> import numpy as np
>>> np.__version__
'1.15.4'
>>> wat = np.array(["wat", "do"])
>>> mask = [True, True]
>>> wat[mask]
array(['wat', 'do'], dtype='|S3')

Here's the fix, which will not alter behavior on current versions of numpy and should have negligible performance costs:

        # filter to include only samples in this assembly
        mask = np.array([i in self.data.snames for i in self.names])
        self.names = np.array(self.names)[mask].tolist()

I think the issue is that if you only use bioconda, something about dependencies in the stock conda channel forces you to get an old version of numpy. This should ONLY effect users on a mac who install ipyrad without including conda-forge as a channel.

isaacovercast commented 3 years ago

The other alternative is pegging the numpy version in the conda recipe.

isaacovercast commented 3 years ago

Verified results of this change are identical on pedicularis data.

isaacovercast commented 3 years ago

The stroke of luck is that I LITERALLY have not used this mac for any analysis in like 2+ years, and for whatever reason I just decided to do it there. I would not have discovered this bug on a current linux box, and tracking this down to an old version of numpy would have driven me totally nuts, might have been impossible.

eaton-lab commented 3 years ago

That's insane! I agree pinning the np version is a good idea in case this is lurking anywhere else.