JuliaPlots / StatsPlots.jl

Statistical plotting recipes for Plots.jl
Other
437 stars 88 forks source link

[BUG] Normalization not working for `groupedhist` #457

Closed ValentinKaisermayer closed 2 years ago

ValentinKaisermayer commented 3 years ago

If groups are used the bins are no longer normalized:

StatsPlots.groupedhist(
    [randn(1000) randn(1000) .+ 2.0], 
    group=repeat(["1", "2"], inner=1000), 
    bar_position=:doge, 
    normalize=true,
)

bug

ValentinKaisermayer commented 3 years ago

https://github.com/JuliaPlots/StatsPlots.jl/blob/b99a4ff6c760f0e1275bb54572c94da03a7bc81f/src/hist.jl#L100 shouldn't it be:

y[:,i] = normed ? h_i.weights / sum(h_i.weights) : h_i.weights

Which would yield: bug

sethaxen commented 3 years ago

Thanks for the issue! First, this is definitely a bug, but the fix depends on what normed should mean here. The fix you have proposed would cause the area of each bin to correspond to p(bin=i | group=j) (*inv(nbins) when bar_position=:dodge), but suppose if group 1 has 100 points total, 50 of which are in bin A, while group 2 has only 1 point, which also falls in group A.

julia> x = [1; ones(50); fill(2, 50)];

julia> group=[1; fill(2, 100)];

julia> groupedhist(x; group=group, normed=false, title="normed=false")

normed_false

julia> groupedhist(x; group=group, normed=true, title="normed=true")

normed_true

This normalization causes group B to have the highest peak, which is misleading, given that it is not only the least abundant group in bin 1, it is also the least abundant group total. The problem is that the normalization hasn't taken into account the imbalance between the groups. i.e. we probably want the area of each bin to correspond to p(bin=i | group=j) p(group=j), which has the added benefit that when one passes bar_position=:stack, then the heights of the stacks should correspond exactly to what one would get with histogram with normed=true.

ValentinKaisermayer commented 3 years ago

Matplotlib seems to do this a bit differently. They have a different normalization, if the bins are stacked or not. https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html In both cases the area under the curve evaluates to one. image

ValentinKaisermayer commented 2 years ago

@sethaxen Is this what you had in mind?

StatsPlots.groupedhist(
    [1; ones(50); fill(2, 50)], 
    group=[1; fill(2, 100)], 
    bar_position=:doge, 
    normalize=true,
)

hist

StatsPlots.groupedhist(
    [1; ones(50); fill(2, 50)], 
    group=[1; fill(2, 100)], 
    bar_position=:doge, 
    normalize=false,
)

hist

ValentinKaisermayer commented 2 years ago
StatsPlots.groupedhist(
    [1; ones(50); fill(2, 50)], 
    group=[1; fill(2, 100)], 
    bar_position=:stack, 
    normalize=true,
)

hist

StatsPlots.histogram(
    [1; ones(50); fill(2, 50)], 
    normalize=:probability,
)

hist

sethaxen commented 2 years ago

@ValentinKaisermayer Yes, this looks perfect! I do think that for bar_position=:dodge, the bar heights should be multiplied by the number of groups to ensure that the areas are the same as for bar_position=:stack, but this would be a functional change and should be handled in a different PR.

ValentinKaisermayer commented 2 years ago

to ensure that the areas are the same

I think this would then be the same behavior as in matplotlib.