Uninformative Error Message when Sample Size is 1

nsgrantham commented 2 years ago

I am encountering an issue where if I pass a Turing model of type DynamicPPL.Model to psis_loo() it returns values of -Inf or NaN.

However, when I pass it a log-likelihood function and the data as an array of tuples, it behaves "correctly" (in the sense that it returns finite values, but I have not independently verified that they are accurate).

The following is a reproducible example.

using Distributions
using DataFrames
using Turing
using ParetoSmooth

function generate_persons(; n_years=1000, max_age=65, n_births=20, age_of_marriage=18)
    # https://github.com/rmcelreath/rethinking/blob/master/R/sim_happiness.R

    invlogit(x) = 1 ./ (exp(-x) .+ 1)
    age = fill(1, n_births)
    married = fill(false, n_births)
    happiness = collect(range(-2, 2, length=n_births))

    for _ in 1:n_years
        age .+= 1
        append!(age, fill(1, n_births))
        append!(married, fill(false, n_births))
        append!(happiness, collect(range(-2, 2, length=n_births)))

        for i in eachindex(age, married, happiness)
            if age[i] >= age_of_marriage && !married[i]
                married[i] = rand(Bernoulli(invlogit(happiness[i] - 4)))
            end
        end

        deaths = findall(age .> max_age)
        deleteat!(age, deaths)
        deleteat!(married, deaths)
        deleteat!(happiness, deaths)
    end

    DataFrame(
        age = age,
        married = convert.(Int, married),
        happiness = happiness
    )
end

persons = generate_persons()
adults = persons[persons.age .> 17, :]
min_age, max_age = extrema(adults.age)
adults.age = (adults.age .- min_age) ./ (max_age - min_age)

@model function adult_happiness(h, a)
    α ~ Normal(0, 1)
    β ~ Normal(0, 2)
    σ ~ Exponential(1)
    μ = α .+ β .* a
    h ~ MvNormal(μ, σ)
end

model = adult_happiness(adults.happiness, adults.age)
post = sample(model, NUTS(), 10_000)

Now, reading through the code in src/TuringHelpers.jl in this repo, it seems as though there's a method that supports Turing models directly. However, if I run

psis_loo(model, post)

then I get

┌ Warning: Some Pareto k values are extremely high (>1). PSIS will not produce consistent estimates.
└ @ ParetoSmooth ~/.julia/packages/ParetoSmooth/agZDP/src/InternalHelpers.jl:43
Results of PSIS-LOO-CV with 10000 Monte Carlo samples and 1 data points. Total Monte Carlo SE of NaN.
┌───────────┬───────┬──────────┬───────┬─────────┐
│           │ total │ se_total │  mean │ se_mean │
├───────────┼───────┼──────────┼───────┼─────────┤
│   cv_elpd │  -Inf │      NaN │  -Inf │     NaN │
│ naive_lpd │  -Inf │      NaN │  -Inf │     NaN │
│     p_eff │   NaN │      NaN │   NaN │     NaN │
└───────────┴───────┴──────────┴───────┴─────────┘

Instead, I can define the appropriate log-likelihood function and provide the data in an array of tuples:

function ll_fun(α, β, σ, data)
    h, a = data
    logpdf(Normal(α .+ β .* a, σ), h)
end

psis_loo(ll_fun, post, collect(zip(adults.happiness, adults.age)))

And this yields something that seems correct.

[ Info: No source provided for samples; variables are assumed to be from a Markov Chain. If the samples are independent, specify this with keyword argument `source=:other`.
Results of PSIS-LOO-CV with 10000 Monte Carlo samples and 960 data points. Total Monte Carlo SE of 0.019.
┌───────────┬──────────┬──────────┬───────┬─────────┐
│           │    total │ se_total │  mean │ se_mean │
├───────────┼──────────┼──────────┼───────┼─────────┤
│   cv_elpd │ -1551.04 │    13.81 │ -1.62 │    0.01 │
│ naive_lpd │ -1548.62 │    13.75 │ -1.61 │    0.01 │
│     p_eff │     2.42 │     0.08 │  0.00 │    0.00 │
└───────────┴──────────┴──────────┴───────┴─────────┘

I would expect these two approaches to yield the same result.

Am I simply using psis_loo() incorrectly when passing it a Turing model of type DynamicPPL.Model? The language that Results of PSIS-LOO-CV with 10000 Monte Carlo samples and 1 data points seems to suggest that it does not recognize that model contains multiple data points.

For now I will continue to use the ll_fun approach, but wanted to call this out here in case there is something that needs a closer look in TuringHelpers.jl.

ParadaCarleton commented 2 years ago

The problem is that the Turing model has an error; it is using MvNormal to mean a sequence of independent, normally-distributed values.

However, MvNormal is supposed to be used when each observation is a vector of correlated observations. For example, if I want to predict each person's height and weight (which are correlated), I could say [height, weight] ~ MvNormal([6 ft, 150 lbs], Σ). As written, your model says "I have a single observation consisting of one really long vector of uncorrelated features." When this single observation is removed, the model breaks because it's being trained on a sample of 0 while trying to predict the whole sample.

Rewriting the Turing model like this should fix the problem:

@model function adult_happiness(h, a)
    α ~ Normal(0, 1)
    β ~ Normal(0, 2)
    σ ~ Exponential(1)
    μ = α .+ β .* a
    @. h ~ Normal(μ, σ)
end

ParadaCarleton commented 2 years ago

(Don't worry, this is a very common mistake -- common enough that I should probably add an explanation when the model has a sample size of 1 :smile:)

nsgrantham commented 2 years ago

Thanks for the quick response and the fix, @ParadaCarleton. I understand the source of the problem now. This is my first time encountering the @. macro, very useful.

I'm not sure I would say that model has an error, however. Using the MvNormal distribution with a mean vector and variance scalar as a "shorthand" for a vector of independent Normal distributions is a convention that Turing's own documentation uses (see the Linear Regression tutorial). I also recall seeing somewhere (perhaps in an issue that I can't seem to find again) that using MvNormal in this way, rather than for i in 1:N; h[i] ~ Normal(...); end, is better for the sampler from a computational perspective.

In any case, I can see how using MvNormal in this way would confuse psis_loo() because it isn't clear from the code alone whether this is a single observation or multiple independent observations.

nsgrantham commented 2 years ago

And yes, changing h ~ MvNormal(μ, σ) to @. h ~ Normal(μ, σ) resolves the issue.

ParadaCarleton commented 2 years ago

I'm not sure I would say that model has an error, however. Using the MvNormal distribution with a mean vector and variance scalar as a "shorthand" for a vector of independent Normal distributions is a convention that Turing's own documentation uses (see the Linear Regression tutorial).

I believe @torfjelde commented on this by saying:

IMO it does go against the semantic meaning of MvNormal: it represents a single multivariate random variable, not a collection of such. Hence if you want a collection of independent observations, then you should use .~. Note that both the LHS and RHS of .~ can be arrays, e.g. [1.0, 2.0] .~ [Normal(), Normal()] also works.

I also recall seeing somewhere (perhaps in an issue that I can't seem to find again) that using MvNormal in this way, rather than for i in 1:N; h[i] ~ Normal(...); end, is better for the sampler from a computational perspective.

Broadcasting (.~) is definitely faster than using a loop, but I think that broadcasting should be just as fast as using MvNormal. Don't quote me on that though.

That being said, I'm working on a PR that will fix this by creating syntax that lets users unambiguously specify what they mean by an independent observation.

TuringLang / ParetoSmooth.jl

Uninformative Error Message when Sample Size is 1 #67