JuliaPlots / StatsPlots.jl

Statistical plotting recipes for Plots.jl
Other
437 stars 88 forks source link

ECDF difference plot with confidence band #452

Open sethaxen opened 3 years ago

sethaxen commented 3 years ago

Sometimes one has an ECDF or random sample and wants to compare it to the CDF of some known distribution. One could plot the ECDF of the sample and the CDF of the distribution and compare them, but they will always deviate, and it's difficult to tell if the difference is meaningful.

There are two ways to handle this: plotting the ECDF difference, and/or plotting a confidence band.

If x is exactly sampled from the distribution d, then the marginal (i.e. pointwise) distribution of the ECDF at each observed x value has a known form ecdf_marginal(d, N, x) = Binomial(N, cdf(d, x)) / N, where N is the size of the sample. One can use this to plot a confidence band that gives a sense for what deviation from d should be expected given the sample size N. An example:

using Distributions, StatsBase, StatsPlots, Random

rng = MersenneTwister(87)
plot()

N = 100
L = 100
α = 1 - 0.99
ecdf_marginal(d, N, x) = Binomial(N, cdf(d, x)) / N
dexp = Uniform(0, 1)

# plot uniform ecdf
euni = ecdf(rand(rng, dexp, N))
xuni = euni.sorted_values
plot!(xuni, euni(xuni) .- cdf.(dexp, xuni); color=:black, seriestype=:steppost, label="Uniform(0, 1)")

# plot Beta(1, 1.05) ecdf
ebeta = ecdf(rand(rng, Beta(1, 1.05), N))
xbeta = ebeta.sorted_values
plot!(xbeta, ebeta(xbeta) .- cdf.(dexp, xbeta); color=:red, seriestype=:steppost, label="Beta(1, 1.05)")
plot!(legend = :outertop)

# plot envelope
xall = unique!(sort!(vcat(xuni, xbeta)))
lb = quantile.(ecdf_marginal.(dexp, N, xall), α / 2)
ub = quantile.(ecdf_marginal.(dexp, N, xall), 1 - α / 2)
expected = cdf.(dexp, xall)
plot!(xall, lb .- expected; fillrange = ub .- expected, color=:lightgray, seriestype=:steppost, alpha=0.25, label="$(1-α)% confidence band")
plot!(xall, zero(expected); color=:gray, alpha=0.5, primary=false)

tmp

I propose that this be callable with an interface like

ecdfplot(xuni; refdist = Uniform(0, 1), difference=true, band_prob=0.95)
ecdfplot!(xbeta; refdist = Uniform(0, 1), difference=true)

It's worth noting that because the confidence interval is pointwise, the envelope should not be expected to contain the entire ECDF 95% of the time; it should in fact be less than that. It might be worth it to have an option to generate simultaneous confidence bands, which we could probably do by adapting the simulation-based method in (ref). These are more easily interpretable, as they should contain the entire ECDF curve 95% of the time.

sethaxen commented 3 years ago

Related issue for bayesplot: https://github.com/stan-dev/bayesplot/issues/267