JuliaStats / StatsBase.jl

Basic statistics for Julia
Other
584 stars 191 forks source link

Using intervals to represent histogram binning #514

Open oschulz opened 5 years ago

oschulz commented 5 years ago

Since I'll have to touch the structure of Histogram for 513 anyhow - there's something I've been thinking a bit for a while, and it might be easiest to do this in one go:

I've always been a bit unhappy about the way histogram edges are handled - the fact that edge vectors are one entry longer than the weights each axis always struck me as somewhat inelegant. It also makes histogram analysis code a bit awkward at times.

I think the natural representations of bins are intervals. So the binning of an edge of a histogram should be a vector of (usually contiguous) intervals. Internally, it should in general still be stored the current way, as a vector of real numbers (length + 1), but it should look like a vector of intervals. So I'd like to propose the following:

Operations like getting all left/right bin edges or all bin centers would simply become broadcasts of minimum/maximum/mean over the binning intervals, etc. Stuff like plotting and histogram analysis code would become simpler and more elegant in general.

I'd like to implement this, but as it's not a trivial change, I'd be glad to get a provisional Ok, before I go ahead (@nalimilan @ararslan @quinnj?).

andreasnoack commented 5 years ago

To me, this seems like complicating the code with very limited practical benefit. Introducing new types always has some overhead and I don't really see the benefit in this case. In particular, I question that allowing flexibility wrt which endpoints are included is worth the trouble. The underlying code would have to take this into account. Histograms are for continuous distributions so shouldn't really matter. Do you have applications where it would be useful?

oschulz commented 5 years ago

Do you have applications where it would be useful?

Sure - we frequently do detailed analysis of histograms. It would be much more natural if StatsBase.Histogram had a representation of a bin on an axis, instead of the current implicit (resp by-convention) way of saying "take axis entries i and i+1". It's kinda hacky, the current way, as a user-facing API. I think intervals would be much cleaner and more natural, this wouldn't come with any runtime/performance cost and I think I could do this without complicating the code overall. Actually, except for the definition of a ContiguousIntervals (that could like in IntervalSets if @timholy agrees), this might actually simplify parts of the Histogram code.

andreasnoack commented 5 years ago

I'm well aware that you do a lot of work with histograms so my questions was more if you could give an example that could help me understand a situation where it would be easier to use intervals. I'd mainly like to understand where the new types would be useful and I'd like to avoid that users would have to bother with :open and :closed.

oschulz commented 5 years ago

if you could give an example that could help me understand a situation where it would be easier to use intervals

For example, when scanning a histogram for certain patterns (peaks, etc.), one often needs to take left and right bound of each bin (or it's width and center) into account. So you'll have code that looks at each bin and it's weight - but currently, we don't have a one-to-one relationship between weight-indices and bin-indices, the user has to reconstruct the bin information in an i and i+1 fashion. And when checking for bounds (e.g before using @inbounds) one needs to check that the one index-axis must be exactly one entry longer than the other. This is all kinda inconvenient and inelegant.

I think when dealing with histograms, we should be able to have a representation of a bin on an axis - and the natural representation is IMHO an interval. One could even broadcast over bins and weights and stuff like that.

I'd mainly like to understand where the new types would be useful

Well, Interval wouldn't be a new type since we have a well-established type for that in IntervalSets, which would be a very lightweight dependency (doesn't pull in any non-stdlib packages, package load time 0.05s). ContiguousIntervals would be a new type - but it would simply do what we do now in functions like _edge_binvolume or in various places in the histogram recipes in Plots.jl. So it would help make histogram codes DRYer across packages.

I'd like to avoid that users would have to bother with :open and :closed.

I don't think users would have to bother with that at all - they would create and use histograms as they do now. The only API-change would be that long-term we'd deprecate accessing hist.edges, which would be replaced by hist.binning. hist.binning[1][n] would simply provide the n-th bin on the first axis, as an Interval - but I don't think many users would bother with the type parameters of that Interval object, except in cases where they're really interested in whether it's closed left or right. In such cases, however, the user would now be able to dispatch on this if necessary.

simonbyrne commented 5 years ago

It would be interesting to try this to see how it works: it would be nice to if the weights and intervals were of the same length.

Would be good to see if this could be build on top of https://github.com/JuliaMath/IntervalSets.jl, which was intended to be a common "interval" package (though I assume more functionality would be required than what is available there).

oschulz commented 5 years ago

It would be interesting to try this to see how it works

Ok, in that case I will at least do a prototype, and then try to convince @andreasnoack based on that. :-)

Will take a bit of time, have lot's of stuff (with deadlines) going on at me moment.

Would be good to see if this could be build on top of https://github.com/JuliaMath/IntervalSets.jl ... (though I assume more functionality would be required than what is available there)

Yes, that's exactly what I have in mind.

oschulz commented 4 years ago

Update: Not abandoned, just deferred due to time constraints. I still definitely plan to pursue this.

oschulz commented 4 years ago

Just a life sign - I'm still interested in this, just overloaded, but it's definitely on my to-do list.

oschulz commented 4 years ago

Haven't forgotten about this.

oschulz commented 3 years ago

Still not forgotten, just overloaded ...

oschulz commented 3 years ago

Sorry for the long silence on this - I've been thinking a bit: Now that view-like wrapper objects don't have to be stack-allocated anymore, we can just add a method binning(hist) or getbinning(hist) that returns the binning as a vector of intervals, wrapping the edges without copying anything, even if the edges are actual vectors and not ranges. And we could add ctors that accept vectors-of-intervals as binning, in addition to edge vectors. So we wouldn't need to change the structure of Histogram right now (but might, longer term, also for #513).

See also #650.