Open oschulz opened 5 years ago
To me, this seems like complicating the code with very limited practical benefit. Introducing new types always has some overhead and I don't really see the benefit in this case. In particular, I question that allowing flexibility wrt which endpoints are included is worth the trouble. The underlying code would have to take this into account. Histograms are for continuous distributions so shouldn't really matter. Do you have applications where it would be useful?
Do you have applications where it would be useful?
Sure - we frequently do detailed analysis of histograms. It would be much more natural if StatsBase.Histogram
had a representation of a bin on an axis, instead of the current implicit (resp by-convention) way of saying "take axis entries i and i+1". It's kinda hacky, the current way, as a user-facing API. I think intervals would be much cleaner and more natural, this wouldn't come with any runtime/performance cost and I think I could do this without complicating the code overall. Actually, except for the definition of a ContiguousIntervals
(that could like in IntervalSets
if @timholy agrees), this might actually simplify parts of the Histogram
code.
I'm well aware that you do a lot of work with histograms so my questions was more if you could give an example that could help me understand a situation where it would be easier to use intervals. I'd mainly like to understand where the new types would be useful and I'd like to avoid that users would have to bother with :open
and :closed
.
if you could give an example that could help me understand a situation where it would be easier to use intervals
For example, when scanning a histogram for certain patterns (peaks, etc.), one often needs to take left and right bound of each bin (or it's width and center) into account. So you'll have code that looks at each bin and it's weight - but currently, we don't have a one-to-one relationship between weight-indices and bin-indices, the user has to reconstruct the bin information in an i
and i+1
fashion. And when checking for bounds (e.g before using @inbounds
) one needs to check that the one index-axis must be exactly one entry longer than the other. This is all kinda inconvenient and inelegant.
I think when dealing with histograms, we should be able to have a representation of a bin on an axis - and the natural representation is IMHO an interval. One could even broadcast over bins and weights and stuff like that.
I'd mainly like to understand where the new types would be useful
Well, Interval
wouldn't be a new type since we have a well-established type for that in IntervalSets
, which would be a very lightweight dependency (doesn't pull in any non-stdlib packages, package load time 0.05s). ContiguousIntervals
would be a new type - but it would simply do what we do now in functions like _edge_binvolume
or in various places in the histogram recipes in Plots.jl. So it would help make histogram codes DRYer across packages.
I'd like to avoid that users would have to bother with :open and :closed.
I don't think users would have to bother with that at all - they would create and use histograms as they do now. The only API-change would be that long-term we'd deprecate accessing hist.edges
, which would be replaced by hist.binning
. hist.binning[1][n]
would simply provide the n-th bin on the first axis, as an Interval
- but I don't think many users would bother with the type parameters of that Interval
object, except in cases where they're really interested in whether it's closed left or right. In such cases, however, the user would now be able to dispatch on this if necessary.
It would be interesting to try this to see how it works: it would be nice to if the weights
and intervals
were of the same length.
Would be good to see if this could be build on top of https://github.com/JuliaMath/IntervalSets.jl, which was intended to be a common "interval" package (though I assume more functionality would be required than what is available there).
It would be interesting to try this to see how it works
Ok, in that case I will at least do a prototype, and then try to convince @andreasnoack based on that. :-)
Will take a bit of time, have lot's of stuff (with deadlines) going on at me moment.
Would be good to see if this could be build on top of https://github.com/JuliaMath/IntervalSets.jl ... (though I assume more functionality would be required than what is available there)
Yes, that's exactly what I have in mind.
Update: Not abandoned, just deferred due to time constraints. I still definitely plan to pursue this.
Just a life sign - I'm still interested in this, just overloaded, but it's definitely on my to-do list.
Haven't forgotten about this.
Still not forgotten, just overloaded ...
Sorry for the long silence on this - I've been thinking a bit: Now that view-like wrapper objects don't have to be stack-allocated anymore, we can just add a method binning(hist)
or getbinning(hist)
that returns the binning as a vector of intervals, wrapping the edges without copying anything, even if the edges are actual vectors and not ranges. And we could add ctors that accept vectors-of-intervals as binning, in addition to edge vectors. So we wouldn't need to change the structure of Histogram
right now (but might, longer term, also for #513).
See also #650.
Since I'll have to touch the structure of
Histogram
for 513 anyhow - there's something I've been thinking a bit for a while, and it might be easiest to do this in one go:I've always been a bit unhappy about the way histogram edges are handled - the fact that edge vectors are one entry longer than the weights each axis always struck me as somewhat inelegant. It also makes histogram analysis code a bit awkward at times.
I think the natural representations of bins are intervals. So the binning of an edge of a histogram should be a vector of (usually contiguous) intervals. Internally, it should in general still be stored the current way, as a vector of real numbers (length + 1), but it should look like a vector of intervals. So I'd like to propose the following:
A new type
ContiguousIntervals
that implements a view of anAbstractVector{T}
with length L + 1 as a vector ofInterval{:closed,:open,T}
(resp.Interval{:open,:closed,T}
) with length L. Maybe this could be contibuted toIntervalSets
(@timholy, do you think that would fit in)?Histogram
's fieldedges::E
gets replaced bybinning::E
withE<:NTuple{N,AbstractVector{<:IntervalSets.Interval}}
. By default, we'd useContiguousIntervals
, but we'd allow any vector of intervals.We use
Base.getproperty
and friends to provide a virtual propertyedges
, for backward compatibility.Operations like getting all left/right bin edges or all bin centers would simply become broadcasts of
minimum
/maximum
/mean
over the binning intervals, etc. Stuff like plotting and histogram analysis code would become simpler and more elegant in general.I'd like to implement this, but as it's not a trivial change, I'd be glad to get a provisional Ok, before I go ahead (@nalimilan @ararslan @quinnj?).