SciTools / biggus

:no_entry: [DEPRECATED] Virtual large arrays and lazy evaluation.
http://biggus.readthedocs.io/
GNU Lesser General Public License v3.0
54 stars 27 forks source link

Add functionality to create a lazy masked array where equal to a given value #181

Closed djkirkham closed 8 years ago

djkirkham commented 8 years ago

This addition is designed to mimic the NumPy function numpy.ma.masked_equal()

marqh commented 8 years ago

this looks like a useful and well coded change

does anyone have objections to me merging this? @bjlittle @pelson @pp-mo

pp-mo commented 8 years ago

thanks @marqh

I don't know enough to be sure about this, but with an eye to future addtions I'm wondering if a two-step solution would be more flexible : That is, provide a two-argument constructor that combines a data-array with a mask-array, treating them both as independent biggus calculations. Then we could define the required function as something like :

def masked_equal(array, value):
    return masked_array(array, equal(array, value))

Where this masked_array is a biggus equivalent of numpy.ma.array(data, mask=mask). It is probably a new array class, something like 'DataAndMask(_Elementwise)', similar to the '_ufunc_wrapper' derivations.

The actual implementation would be basically much as already given in @djkirkham existing proposal. However, I'm suggesting the calculation is purely elementwise, so to do what we want here (compare with a constant) it uses broadcasting -- but I think that comes for free with the existing implementation.

The benefit is, it's then dead simple to provide similar new operators such as masked_less. Assuming we get around to needing such additions, this way could create a lot less clutter. In direct usage, it can also handle much more complex cases, like masked_array(a, mask=(b < c * 0.5)).

Does anyone think that is a good idea ?

djkirkham commented 8 years ago

It seems to me that one problem with a general masked array class is that if the mask it based on the underlying array then when the data is loaded it must be loaded twice - once for the data and once for the mask. Is there a way around it other than implementing special cases like this one? Is it even a serious problem?

pp-mo commented 8 years ago

@djkirkham when the data is loaded it must be loaded twice

Ooh, good point !

I tend to forget that biggus is not like a language with lazy evaluation : they tend to avoid re-calculation. I think it was originally thought we would eventually address that in the 'engines' code, but it has never seemed much of a priority.

I do know we haven't bothered to avoid some duplicate evaluations in the StaGE configurations, as it was thought a minor problem : But I think that is mostly for orographies, where they are smaller (lower-dimensional) than the other data they are combined with.

I'm not sure what I think about the practicality of this now - it depends on intended usage.

marqh commented 8 years ago

i'm looking to cut a biggus release today.

On that basis, this either goes in as is, or it waits for further implementation.

My view is that this is useful, and that if a more flexible implementation arrives later, then this can become a call to that whilst maintaining the API call. On that Basis, my vote is for merge

@pp-mo @djkirkham please may you indicate whether you agree or disagree with me on this? thank you mark

pp-mo commented 8 years ago

@marqh agree or disagree

!agree! :+1:

djkirkham commented 8 years ago

@marq Sounds sensible, I agree.

pelson commented 8 years ago

I'm concerned about the results of this change:

In [3]: import biggus

In [4]: import numpy as np

In [5]: a = np.arange(10) % 2

In [6]: print(a)
[0 1 0 1 0 1 0 1 0 1]

In [7]: n_m = np.ma.masked_equal(a, 1)

In [8]: b_m = biggus.masked_equal(a, 1)

In [9]: print(n_m.mean(), biggus.mean(b_m, axis=0).ndarray())
(0.0, array(0.5))

Is there a reason you didn't implement ndarray @djkirkham?

djkirkham commented 8 years ago

Good spot, but I don't think it's a bug introduced by this change. Rather, it's a problem with biggus.mean(). The result is the same if line 8 is replaced with b_m = biggus.NumpyArrayAdapter(n_m)