JuliaStats / StatsBase.jl

Basic statistics for Julia
Other
577 stars 190 forks source link

fit nbis=2 produces results of nbins=3 #410

Open kakila opened 5 years ago

kakila commented 5 years ago

When fitting a histogram to data with nbins=2 I get the results corresponding to 3 bins. Same happens if I pass 0.0:0.5:1.5 and closed=:left instead. I get 3 bins.

julia> h = fit(Histogram, [0,1,0,0,1,1,1,1], nbins=2, closed=:right)
Histogram{Int64,1,Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}
edges:
  -0.5:0.5:1.0
weights: [3, 0, 5]
closed: right
isdensity: false

julia> h = fit(Histogram, [0,1,0,0,1,1,1,1], nbins=3, closed=:right)
Histogram{Int64,1,Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}
edges:
  -0.5:0.5:1.0
weights: [3, 0, 5]
closed: right
isdensity: false

I guess this is a bug, otherwise how to avoid this issue?

nalimilan commented 5 years ago

I'm not familiar with the histogram code, but AFAICT that's expected, and I get the same thing with R. The number of bins is only a suggestion, and it's not always possible to achieve the requested number. Maybe the documentation needs to be more explicit (it already says "approximate number of bins").

kakila commented 5 years ago

Ok, I do not how R handles the issue.

Then, to specify a trivial 2 bin histogram, Is one forced to do something like the following?

h = fit(Histogram, [0,1,0,0,1,1,1,1], -0.02:0.51:1.0, closed=:right)

Why can the function not resolve nbins=2 in this case to any valid edges (the above just one out of an infinite)?

jamblejoe commented 4 years ago

Just to revive an old issue or "feature": Having a function, guessing how many bins to chose if one do not provide anything, is useful. BUT if I do provide the number of bins, I want to be sure that I exactly get that number of bins in my histogram. What is the point of not having this behaviour?

Moelf commented 3 years ago

I think this can lead to unnecessary grief for the users: imagine two parties trying to sync bin counts... cc @oschulz for perspective

I think we can make nbins=X strictly enforced since in real life people are probably counting on nbins being very closely realized anyways.

jamblejoe commented 3 years ago

@Moelf , could you be a bit pro precise on your example? I dont understand how two parties are involved in plotting a histogram. Could give an example?

Moelf commented 3 years ago

it's common in science/data science people who processed data independently want to verify the result are the same (sync).

In this case, a naive understanding of nbins would lead to n bins (for example, if someone compares to numpy.histogram)

jamblejoe commented 3 years ago

@Moelf , yes sure! I misunderstood your comment. You suggest nbins specifying the exact number of bins.

oschulz commented 3 years ago

I think we can make nbins=X strictly enforced

I fully agree: if the user specifies nbins = 2, then they should get two bins. Getting 3 bins is surprising to the user, and a histogram with 2 bins (and even one bin with nbins = 1) is a perfectly valid thing.

nalimilan commented 3 years ago

At the very least it would be interesting to start providing an argument to force using exactly the requested number of bins. That wouldn't be breaking. Then in the next breaking release we could change the default if we're happy with how the new argument works.

thompsonmj commented 2 years ago

It seems to be convention to make 'nice' numbers for bin widths. I like the suggestion by @nalimilan to add a new argument like nbins_exact. This could be implemented to match functionality between

h = fit(Histogram, x, range(minimum(x), stop=maximum(x), length=29)) and h = fit(Histogram, x, nbins_exact=28)

oschulz commented 2 years ago

Hm, having two competing nbins kwargs makes things ambiguous, though. Maybe some kind of annotation like nbins = (100, :exact) or so instead?

Moelf commented 2 years ago

I believe the change is not breaking, reason: we said it's approximate anyways, which means

oschulz commented 2 years ago

I think the cleanest solution would be to have binning algorithms, represented by types/structs that can then have parameters like exact. @mkborregaard and me did add a several binning algorithms to Plots a long time ago (https://github.com/JuliaPlots/Plots.jl/blob/fd46fd4addd040c4e5b73b00e71109a46eea620e/src/recipes.jl#L789), but specified as keywords. This had always been intended to be an interim solution until we added them to StatsBase.Histogram in a proper way (with types), we just never got around to it. Maybe it's really time now.

Moelf commented 2 years ago

I actually have a histogram package because StatsBase seems reluctant to support more advanced usage such as:

hist1 / hist2

# thread-safe
push!(hist, val, weight)
oschulz commented 2 years ago

It would be really nice to support the thread-safe push! directly in StatsBase.Histogram

Moelf commented 2 years ago

Yes, but I think the idea is this is "Base" histogram. Well, I'm happy to be part of StatsBase for sure since I'm already doing (<: AbstractHistogram) but I don't know what the devs here think.

thompsonmj commented 2 years ago

More advanced usage can maybe go in a new issue ... I just want my bin counts 🙃

nalimilan commented 2 years ago

See also https://github.com/JuliaStats/StatsBase.jl/pull/533.

I actually have a histogram package because StatsBase seems reluctant to support more advanced usage such as:

hist1 / hist2

# thread-safe
push!(hist, val, weight)

Who said this? Actually at https://github.com/JuliaStats/StatsBase.jl/issues/650 I said almost the contrary. Improvements to histograms in StatsBase would be welcome (though I admit finding reviewers isn't always easy).

mkborregaard commented 2 years ago

Oh wow @oschulz have those methods not been moved yet? Now is clearly the time :-)

oschulz commented 2 years ago

You're so right @mkborregaard . We should do it dispatched-based, with binning algorithms structs (so they can, in principle, take parameters), not with keyword-symbols like in Plots. And it could tie in with #514 (still on my to-do list).

feanor12 commented 2 years ago

If the number of bins is approximated it would be good to document how it is approximated. Even better would be the possibility to select the approximation algorithm. This could then also have the option to get the exact number of bins.

A few options I found are:

oschulz commented 2 years ago

Yes, that's the kind of methods we have in Plot now and that we should move over to StatsBase (with algorithm structs, not keywords like in Plots).

thompsonmj commented 2 years ago

Gentle reminder if schedules permit:

@oschulz ...

I think the cleanest solution would be to have binning algorithms, represented by types/structs that can then have parameters like exact. @mkborregaard and me did add a several binning algorithms to Plots a long time ago (https://github.com/JuliaPlots/Plots.jl/blob/fd46fd4addd040c4e5b73b00e71109a46eea620e/src/recipes.jl#L789), but specified as keywords. This had always been intended to be an interim solution until we added them to StatsBase.Histogram in a proper way (with types), we just never got around to it. Maybe it's really time now.

@mkborregaard ...

Oh wow @oschulz have those methods not been moved yet? Now is clearly the time :-)

@oschulz ...

You're so right @mkborregaard . We should do it dispatched-based, with binning algorithms structs (so they can, in principle, take parameters), not with keyword-symbols like in Plots. And it could tie in with #514 (still on my to-do list).

oschulz commented 2 years ago

Thanks for the reminder! It's been on the back of my mind, I'll try to get on it soonish.

huixinzhang commented 2 years ago

This code works now: h = fit(Histogram, x, range(minimum(x), stop=maximum(x), length=nbins)).

thompsonmj commented 2 years ago

Hi @huixinzhang, I think the core of the discussion centers on the default response to using the nbins keyword argument. The workaround using range() is useful, but I believe nbins should use the exact number of bins asked for if the code is to speak for itself!