cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
285 stars 49 forks source link

`out` kwarg to `numpy.sum` "unexpected" #66

Closed travc closed 8 years ago

travc commented 8 years ago

numpy.sum (I'm using version 1.10.2) is throwing a TypeError exception on at least the object returned by GenotypeChunkedArray.to_nref(). Looks to me like the bug is that the out kwarg isn't expected down at the bcolz level... way too deep for me to try to properly fix.

I just updated scikit-allel (0.20.3) today from a fresh git clone, and numpy is 1.10.2 as mentioned. Anyway, I'll hack in a workaround since I've got some results to deliver tomorrow, but a more proper fix is beyond my abilities (at least in any sort of timely manner).

Here is a traceback (g is a allel.GenotypeChunkedArray which has been subset and filtered with compress):

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in ()
      2 # filter by number of *samples* which have a low-freq allele (either hom or het)
      3 nref = g.to_n_ref(fill=-1)
----> 4 nsamp_alt = np.sum(nref > 0,axis=1)
      5 nsamp_ref = np.sum(np.logical_and(nref >= 0, nref < 2), axis=1)
      6 flt2 = np.logical_or(nsamp_ref == TARGET_OBSERVATION_COUNT, nsamp_alt == TARGET_OBSERVATION_COUNT)

/usr/local/lib/python3.4/dist-packages/numpy/core/fromnumeric.py in sum(a, axis, dtype, out, keepdims)
   1826                                  out=out, keepdims=keepdims)
   1827         # NOTE: Dropping the keepdims parameters here...
-> 1828         return sum(axis=axis, dtype=dtype, out=out)
   1829     else:
   1830         return _methods._sum(a, axis=axis, dtype=dtype,

/usr/local/lib/python3.4/dist-packages/allel/chunked/core.py in asum(data, axis, mapper, blen, storage, create, **kwargs)
    201     return reduce_axis(data, axis=axis, reducer=np.sum,
    202                        block_reducer=np.add, mapper=mapper,
--> 203                        blen=blen, storage=storage, create=create, **kwargs)
    204 
    205 

/usr/local/lib/python3.4/dist-packages/allel/chunked/core.py in reduce_axis(data, reducer, block_reducer, mapper, axis, blen, storage, create, **kwargs)
    173             r = reducer(block, axis=axis)
    174             if out is None:
--> 175                 out = getattr(storage, create)(r, expectedlen=length, **kwargs)
    176             else:
    177                 out.append(r)

/usr/local/lib/python3.4/dist-packages/allel/chunked/storage_bcolz.py in array(self, data, expectedlen, **kwargs)
     36         data = _util.ensure_array_like(data)
     37         kwargs = self._set_defaults(kwargs)
---> 38         return bcolz.carray(data, expectedlen=expectedlen, **kwargs)
     39 
     40     def table(self, data, names=None, expectedlen=None, **kwargs):

bcolz/carray_ext.pyx in bcolz.carray_ext.carray.__cinit__ (bcolz/carray_ext.c:13467)()

TypeError: __cinit__() got an unexpected keyword argument 'out'
alimanfoo commented 8 years ago

Thanks Travis, I'll take a look ASAP. It looks like the offending statement is nref > 0 but will need to investigate.

If the data are not too big then one workaround may be to convert the nref chunked array to numpy, ie, call nref[:] > 0.

On Friday, 5 February 2016, Travis Collier notifications@github.com wrote:

numpysum (I'm using version 1102) is throwing a TypeError exception on at least the object returned by GenotypeChunkedArrayto_nref() Looks to me like the bug is that the out kwarg isn't expected down at the bcolz level way too deep for me to try to properly fix

I just updated scikit-allel (0203) today from a fresh git clone, and numpy is 1102 as mentioned Anyway, I'll hack in a workaround since I've got some results to deliver tomorrow, but a more proper fix is beyond my abilities (at least in any sort of timely manner)

Here is a traceback (g is a allelGenotypeChunkedArray which has been subset and filtered with compress):


TypeError Traceback (most recent call last) in () 2 # filter by number of samples which have a low-freq allele (either hom or het) 3 nref = gto_n_ref(fill=-1) ----> 4 nsamp_alt = npsum(nref > 0,axis=1) 5 nsamp_ref = npsum(nplogical_and(nref >= 0, nref < 2), axis=1) 6 flt2 = nplogical_or(nsamp_ref == TARGET_OBSERVATION_COUNT, nsamp_alt == TARGET_OBSERVATION_COUNT)

/usr/local/lib/python34/dist-packages/numpy/core/fromnumericpy in sum(a, axis, dtype, out, keepdims) 1826 out=out, keepdims=keepdims) 1827 # NOTE: Dropping the keepdims parameters here -> 1828 return sum(axis=axis, dtype=dtype, out=out) 1829 else: 1830 return _methods_sum(a, axis=axis, dtype=dtype,

/usr/local/lib/python34/dist-packages/allel/chunked/corepy in asum(data, axis, mapper, blen, storage, create, _kwargs) 201 return reduce_axis(data, axis=axis, reducer=npsum, 202 block_reducer=npadd, mapper=mapper, --> 203 blen=blen, storage=storage, create=create, _kwargs) 204 205

/usr/local/lib/python34/dist-packages/allel/chunked/corepy in reduce_axis(data, reducer, block_reducer, mapper, axis, blen, storage, create, _kwargs) 173 r = reducer(block, axis=axis) 174 if out is None: --> 175 out = getattr(storage, create)(r, expectedlen=length, _kwargs) 176 else: 177 outappend(r)

/usr/local/lib/python34/dist-packages/allel/chunked/storage_bcolzpy in array(self, data, expectedlen, _kwargs) 36 data = _utilensure_array_like(data) 37 kwargs = self_set_defaults(kwargs) ---> 38 return bcolzcarray(data, expectedlen=expectedlen, _kwargs) 39 40 def table(self, data, names=None, expectedlen=None, **kwargs):

bcolz/carray_extpyx in bcolzcarray_extcarraycinit (bcolz/carray_extc:13467)()

TypeError: cinit() got an unexpected keyword argument 'out'

— Reply to this email directly or view it on GitHub https://github.com/cggh/scikit-allel/issues/66.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

alimanfoo commented 8 years ago

Hi Travis, I've got to the bottom of this. There is a PR (#67) which includes a workaround, so this error will not be raised in future.

However, I should mention that in general, passing a chunked array into any of the normal numpy functions like np.sum() as if it were a numpy array may not always work as expected. In the best case it will automatically load the chunked array into a plain numpy array, which may consume lots of memory if the array is large. In the worst case (like this one) behaviour may be unexpected. I have included a workaround for this case, but there may be other similar oddities in numpy.

In this case nref is a chunked array, and nref > 0 also returns a chunked array. There are a couple of options that avoid calling np.sum() on a chunked array, and make explicit when you are or are not converting chunked arrays into numpy arrays.

The first option is to convert nref > 0 into a numpy array before passing into numpy.sum(). E.g.:

t = (nref > 0)[:]  # this step converts to numpy array
nsamp_alt = np.sum(t, axis=1)

A second option, which is better if your genotype array is large, is to use the chunked sum() function originating from the allel.chunked.core module. This is available as a method on all chunked arrays. E.g.::

nsamp_alt = (nref > 0).sum(axis=1)

Hth.

travc commented 8 years ago

Thanks. Using the proper sum function for chunked arrays is certainly the way to go. It would be nice if less-than-proper stuff just didn't work at all... not sure how that would be possible without changing numpy though. Sometime I really wish python had strict typing (of course when I'm writing C or C++ code, I feel otherwise).

One thing though... I see documentation for allel.chunked.core.asum function, but none for sum.

travc commented 8 years ago

Quick followup question... How would I more properly do something like:

nsamp_ref = np.sum(np.logical_and(nref >= 0, nref < 2), axis=1)

when nref is a large chunked array? (I'm counting up the number of samples which have the reference allele (either hom or het) for each loci.)

alimanfoo commented 8 years ago

On Fri, Feb 5, 2016 at 5:24 PM, Travis Collier notifications@github.com wrote:

Thanks. Using the proper sum function for chunked arrays is certainly the way to go. It would be nice if less-than-proper stuff just didn't work at all... not sure how that would be possible without changing numpy though. Sometime I really wish python had strict typing (of course when I'm writing C or C++ code, I feel otherwise).

Yeah, this is a weird case. Apparently when you call numpy sum, if the argument is not an array but has a "sum" function then numpy will just call it, assuming it is the right thing to do. In other cases, numpy will call np.asarray() on the arguments which will coerce to a numpy array. I think it would be better if numpy always did np.asarray() on the arguments if it needs an array, that should then raise an error if the argument cannot be converted to an array.

One thing though... I see documentation for allel.chunked.core.asum function, but none for sum.

Sorry, yes it's "asum" to avoid clobbering the built-in sum. But the method on ChunkedArray is "sum".

— Reply to this email directly or view it on GitHub https://github.com/cggh/scikit-allel/issues/66#issuecomment-180452418.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

alimanfoo commented 8 years ago

Regarding the question about np.sum(np.logical_and(nref >= 0, nref < 2), axis=1), there's a couple of options.

Sticking with the allel.chunked module, this isn't pretty but works:

nsamp_ref = (nref >= 0).binary_op(operator.and_, nref < 2).sum(axis=1)

An alternative is to use dask.array which has a much greater range of operations, and should also execute more efficiently (less memory, possibly faster). Support for dask is new in scikit-allel and not heavily tested yet so any feedback very welcome. The API is slightly different, nothing is ever materialised as a numpy array unless you call compute(). For example:

callset = h5py.File('/path/to/my/callset.h5', mode='r')
g = allel.GenotypeDaskArray.from_array(callset['3L/calldata/genotype'])
nref = g.to_n_ref()
d = ((nref >= 0) & (nref < 2)).sum(axis=1)  # dask array
nsamp_ref = d.compute()  # materialise as numpy array
alimanfoo commented 8 years ago

Referencing #69, I'm going to add support for more binary operators on the ChunkedArray class, so (nref >= 0) & (nref < 2) will work. However, using dask.array is still a better general solution.

travc commented 8 years ago

Thanks. I'm not at all familiar withdask.array, but will give it a try. My current application is just complicated enough to be non-trivial. Hopefully other's will find the tip useful.

travc commented 8 years ago

allel.models.dask doesn't get loaded (silently fails) without the toolz module, which isn't in the list of prerequisites. Caused me a bit of confusion.

travc commented 8 years ago

Using dask arrays is processing much slower than chunked arrays for my particular task at the moment (no clue why). The dataset size isn't too huge, so just making nref into a numpy.array works well enough.

alimanfoo commented 8 years ago

I keep meaning to raise an issue about this, done https://github.com/blaze/dask/issues/962

On Saturday, 6 February 2016, Travis Collier notifications@github.com wrote:

allel.models.dask doesn't get loaded (silently fails) without the toolz module, which isn't in the list of prerequisites. Caused me a bit of confusion.

— Reply to this email directly or view it on GitHub https://github.com/cggh/scikit-allel/issues/66#issuecomment-180874093.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

alimanfoo commented 8 years ago

Thanks for the feedback. I'm sure you're too busy right now to do any digging but if you are doing some heavy workloads in future and have a bit of time it would interesting to dig into the performance. Dask should be faster on larger data because it should manage to do some work in parallel, but I haven't used it much yet so don't have a lot of experience.

On Saturday, 6 February 2016, Travis Collier <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Using dask arrays is processing much slower than chunked arrays for my particular task at the moment (no clue why). The dataset size isn't too huge, so just making nref into a numpy.array works well enough.

— Reply to this email directly or view it on GitHub https://github.com/cggh/scikit-allel/issues/66#issuecomment-180875421.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

travc commented 8 years ago

Will do. It is most probable that I'm doing something in a particularly inefficient way.

What I'm doing is just filtering. More specifically, I'm building several boolean arrays based on allele_count and nref, and-ing them together, and applying it with compress. Doing it successively would be more efficient, but I didn't notice a simple way without having to recompute the counts. I think that sort of filtering is a common enough application that some examples or even a little tutorial of doing it with dask arrays (and chunked arrays) might be generally useful. Maybe even some helper functions.

alimanfoo commented 8 years ago

Could you say in word what the filtering condition is? I.e., you want to retain variants where...

On Sunday, 7 February 2016, Travis Collier notifications@github.com wrote:

Will do. It is most probable that I'm doing something in a particularly inefficient way.

What I'm doing is just filtering. More specifically, I'm building several boolean arrays based on allele_count and nref, and-ing them together, and applying it with compress. Doing it successively would be more efficient, but I didn't notice a simple way without having to recompute the counts. I think that sort of filtering is a common enough application that some examples or even a little tutorial of doing it with dask arrays (and chunked arrays) might be generally useful. Maybe even some helper functions.

— Reply to this email directly or view it on GitHub https://github.com/cggh/scikit-allel/issues/66#issuecomment-180944599.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com alimanfoo@gmail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

travc commented 8 years ago

My implementation works and isn't for 'production', so don't feel any obligation to spend time/effort on it... The application is creating an input file for a dispersal estimation program (which I've got working, but am still trying to figure out how to (or if I can) get useful information out of.)

I'm starting with a VCF (converted to hdf5) filtered for accessible sites and snpEff MODIFIER (also doing those filters in allel would be good for production work).

  1. take a specific subset of samples.
  2. biallelic
  3. non-singleton
  4. (the odd part) minor allele (could be ref or alt) occurs in exactly N samples... eventually I'll probably do that by collection site instead of sample, which will be more involved.

The code based on the PCA and Fst examples from your blog:

    # load the callset
    callset = h5py.File(args.callset, mode='r')
    all_sample_ids = [ _.decode('utf-8') for _ in callset[list(callset.keys())[0]]['samples'] ]
    # convert the callset to a genotype array (@TCC single chrom version for now)
    g_all = allel.GenotypeChunkedArray(callset[chrom]['calldata']['genotype'])

    ## Subset by sample
    sample_idxs = samplenames2idxs(callset[chrom], sample_ids)
    print("sample_idxs:", sample_idxs)
    if not all(sample_idxs[i] <= sample_idxs[i+1] for i in range(len(sample_idxs)-1)):
        print("ERROR: sample_ids must be in same order as in the callset file!", file=sys.stderr)
    else:
        if sample_ids != [ _.decode('utf-8') for _ in callset[list(callset.keys())[0]]['samples'] ]:
            g = g_all.subset(None, sample_idxs)

    ## Filter

    # allele counts at each variant
    ac = g.count_alleles()

    # allele type counts (just for curiosity)
    multi_allelic = np.count_nonzero(ac.max_allele() > 1)
    biallelic = np.count_nonzero(ac.max_allele() == 1)
    print('input sites:', g.n_variants)
    print('multi/biallelic:',multi_allelic,'/',biallelic,'=',(multi_allelic/biallelic))
    print('singletons:',np.count_nonzero((ac.max_allele() == 1) & (ac[:, :2].min(axis=1) <= 1)))

    # filtering (biallelic)
    flt1 = (ac.max_allele() == 1)
    flt1 = np.logical_and(flt1, ac[:, :2].min(axis=1) > 1)
    print('biallelic filter passes:',sum(flt1))

    # filter by number of *samples* which have a low-freq allele (either hom or het)
    # still using ac
    nref = g.to_n_ref(fill=-1)[:] # @TCC the '[:]' converts from ChunkedArray to numpy.array, possibly not needed
    nsamp_alt = (nref > 0).sum(axis=1)
    nsamp_ref = np.sum(np.logical_and(nref >= 0, nref < 2), axis=1)
    flt2 = np.logical_or(nsamp_ref == TARGET_OBSERVATION_COUNT, nsamp_alt == TARGET_OBSERVATION_COUNT)
    print('variants with', TARGET_OBSERVATION_COUNT,
          'samples obsered to have non-major allele:', sum(flt2))

    # combine the filters
    flt = np.logical_and(flt1, flt2)
    # apply filter (Don't forget to filter the positions/varants too!)
    print('combined filter passes:', sum(flt))
    gf = g.compress(flt, axis=0)
    pos_all = allel.SortedIndex(callset[chrom]['variants']['POS'][:])
    posf = pos_all.compress(flt)

    # should have same num of positions and rows in genotype array
    print(posf.shape, "should have same positions as", gf.shape)
alimanfoo commented 8 years ago

Thanks Travis, interesting use case. Here's one way to do it, using Ag1000G data: http://nbviewer.jupyter.org/gist/alimanfoo/12588417a27d1afcb9ba/travis_use_case.ipynb

alimanfoo commented 8 years ago

Resolved in PR #67.