Closed ValentinKaisermayer closed 2 years ago
https://github.com/JuliaPlots/StatsPlots.jl/blob/b99a4ff6c760f0e1275bb54572c94da03a7bc81f/src/hist.jl#L100 shouldn't it be:
y[:,i] = normed ? h_i.weights / sum(h_i.weights) : h_i.weights
Which would yield:
Thanks for the issue! First, this is definitely a bug, but the fix depends on what normed
should mean here. The fix you have proposed would cause the area of each bin to correspond to p(bin=i | group=j)
(*inv(nbins)
when bar_position=:dodge
), but suppose if group 1 has 100 points total, 50 of which are in bin A, while group 2 has only 1 point, which also falls in group A.
julia> x = [1; ones(50); fill(2, 50)];
julia> group=[1; fill(2, 100)];
julia> groupedhist(x; group=group, normed=false, title="normed=false")
julia> groupedhist(x; group=group, normed=true, title="normed=true")
This normalization causes group B to have the highest peak, which is misleading, given that it is not only the least abundant group in bin 1, it is also the least abundant group total. The problem is that the normalization hasn't taken into account the imbalance between the groups. i.e. we probably want the area of each bin to correspond to p(bin=i | group=j) p(group=j)
, which has the added benefit that when one passes bar_position=:stack
, then the heights of the stacks should correspond exactly to what one would get with histogram
with normed=true
.
Matplotlib seems to do this a bit differently. They have a different normalization, if the bins are stacked or not. https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html In both cases the area under the curve evaluates to one.
@sethaxen Is this what you had in mind?
StatsPlots.groupedhist(
[1; ones(50); fill(2, 50)],
group=[1; fill(2, 100)],
bar_position=:doge,
normalize=true,
)
StatsPlots.groupedhist(
[1; ones(50); fill(2, 50)],
group=[1; fill(2, 100)],
bar_position=:doge,
normalize=false,
)
StatsPlots.groupedhist(
[1; ones(50); fill(2, 50)],
group=[1; fill(2, 100)],
bar_position=:stack,
normalize=true,
)
StatsPlots.histogram(
[1; ones(50); fill(2, 50)],
normalize=:probability,
)
@ValentinKaisermayer Yes, this looks perfect! I do think that for bar_position=:dodge
, the bar heights should be multiplied by the number of groups to ensure that the areas are the same as for bar_position=:stack
, but this would be a functional change and should be handled in a different PR.
to ensure that the areas are the same
I think this would then be the same behavior as in matplotlib.
If
group
s are used the bins are no longer normalized: