Open schlunma opened 6 months ago
@SciTools/peloton Thanks @schlunma for this suggestion. We are curious about your use case, could you share this with us and why you require this additional Iris specific functionality rather than just using numpy?
The numpy and dask versions will always collapse the entire array; there is now way of calculating the histogram along one or more specified axes.
However, this is exactly what I need for my particular use case: I want to calculate a metric called Earth mover's distance across different coordinates. The default numpy and dask histograms would only allow me to calculate that metric across the entire dataset. More details can be found in the corresponding ESMValCore PR, in there you can also find working code.
There is an open issue in numpy about adding the axis
keyword to histogram
. Would it be worth trying to push that forward first?
https://github.com/numpy/numpy/issues/13166
Thanks for the link @rcomer! Especially the xhistogram
package looks super relevant; unfortunately, it looks like it's not maintained anymore. Getting the function into numpy would not be enough for me, since I also need a dask version of it (and from the comments in the linked issue, it also seems that this is not trivial at all if one wants to do that properly). On the other hand, my current solution is quite simple and just relies on np.vectorize
(which makes it slower, but the performance is ok).
I am also completely fine to include this into ESMValCore, so we can close this if this is not relevant for you.
I am also completely fine to include this into ESMValCore, so we can close this if this is not relevant for you.
I reckon further discussion first, now we have a more detailed use case.
ESMValCore ... if this is not relevant for you.
For me the key question here is : ? what is the point of making this function of a Cube, rather than just an operation on an array, calc(array, over_axes=None, n_bins=10, bins=None)
?
It could be that the coords add some validity to operation, or that a Cube with a 'value_bins' dimension is itself useful. Perhaps iris.plot has a role. But so far I haven't got the killer need : why isn't this just a piece of maths ?
I don't have the killer argument for this; I guess it's just nicer to have this work with labeled dimensions instead of axes and include proper metadata handling. For my specific use case, it would also be totally fine to have this work with arrays.
However, your argumentation could also be applied to most mathematical operations in iris, right? For example, why do you have cube.collapsed(coords, iris.analysis.MEAN)
when you could do array.mean(axis=...)
? Why do you support cube1 + cube2
when you could simply do array1 + array2
?
I don't have the killer argument for this; I guess it's just nicer to have this work with labeled dimensions instead of axes and include proper metadata handling. For my specific use case, it would also be totally fine to have this work with arrays. However, your argumentation could also be applied to most mathematical operations in iris, right? For example, why do you have
cube.collapsed(coords, iris.analysis.MEAN)
when you could doarray.mean(axis=...)
? Why do you supportcube1 + cube2
when you could simply doarray1 + array2
?
Totally, it's a judgement thing. But to your specific examples, statistics and arithmetic do both contain useful metadata handling, to modify cell-methods and units.
In this case, I guess the result cube would always have a count or frequency identity, so probably a long-name and units of '1'. AFAICT there aren't really any useful CF concepts that we could apply here, though. I guess we'd like to be able to have a cube like "frequency of air_temperature" with units like "frequency", "fraction" or "count", but such things are currently out-of-scope -- there isn't even a standard "extension" attributes for non-standard units, like 'long_name' is so often used. Likewise, a cell-method might make sense to describe the dimensions over which the operation was applied. But again, it would need an extension to the standardised forms, e.g. "histogram over time".
✨ Feature Request
I am currently working on an ESMValTool preprocessor that calculates histograms from cubes along given coordinates similar to
np.histogram
. I think this would also be a nice fit to iris in theiris.analysis.stats
module. Here is a possible call signature:This function should fully support lazy and/or masked data. If this is considered relevant for iris, I can open a PR (already have some code for this).
Motivation
Calculating histograms is a common task in geosciences.