StatisticalRethinkingJulia / StatisticalRethinking.jl

Julia package with selected functions in the R package `rethinking`. Used in the SR2... projects.
MIT License
385 stars 32 forks source link

Problem with Chains to DataFrame conversion #118

Closed Shmuma closed 2 years ago

Shmuma commented 3 years ago

Hello!

I found the issue with dataframe conversion method defined in df.jl. Resulting dataframe might have columns in the wrong order in comparison to the original chain.

How to reproduce:

using Turing
using Distributions
using StatisticalRethinking

@model function model(y)
    μ ~ Normal(5, 1)
    σ ~ Uniform(0, 1)
    b ~ Uniform(0, 1)
    y ~ Normal(μ, σ) + b
end

chain = sample(model(10), NUTS(), 1000)
display(describe(ch2)[1]);

first(DataFrame(ch2), 10)

The output from the chain is following

Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat   e ⋯
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64     ⋯

           μ    7.5558    0.7538     0.0238    0.0299   449.4640    1.0019     ⋯
           σ    0.8266    0.1614     0.0051    0.0095   335.4502    1.0006     ⋯
           b    0.6971    0.2339     0.0074    0.0113   549.3266    1.0028     ⋯

But the resulting dataframe is image

So, parameters b and μ are swapped.

Without importing StatisticalRethinking package, everything works as expected. Why do we need this conversion at all? Turing provides Table.jl interface, which does the conversion without any extra methods.

goedman commented 3 years ago

Thank you, I’ll have a look later today.

Your final remark is correct, now MCMCCahins.Chains support the Tables.jl interface this is no longer necessary (and here even produces the wrong result).

Rob J Goedman @.***

On Aug 12, 2021, at 02:20, Max Lapan @.***> wrote:

Hello!

I found the issue with dataframe conversion method defined in df.jl. Resulting dataframe might have columns in the wrong order in comparison to the original chain.

How to reproduce:

using Turing using Distributions using StatisticalRethinking

@model function model(y) μ ~ Normal(5, 1) σ ~ Uniform(0, 1) b ~ Uniform(0, 1) y ~ Normal(μ, σ) + b end

chain = sample(model(10), NUTS(), 1000) display(describe(ch2)[1]);

first(DataFrame(ch2), 10) The output from the chain is following

Summary Statistics parameters mean std naive_se mcse ess rhat e ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯

       μ    7.5558    0.7538     0.0238    0.0299   449.4640    1.0019     ⋯
       σ    0.8266    0.1614     0.0051    0.0095   335.4502    1.0006     ⋯
       b    0.6971    0.2339     0.0074    0.0113   549.3266    1.0028     ⋯

But the resulting dataframe is https://user-images.githubusercontent.com/66154/129163035-2ed0f7a5-f79a-495d-aa1e-3a352c2f48c7.png So, parameters b and μ are swapped.

Without importing StatisticalRethinking package, everything works as expected. Why do we need this conversion at all? Turing provides Table.jl interface, which does the conversion without any extra methods.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/StatisticalRethinkingJulia/StatisticalRethinking.jl/issues/118, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMX2D2O5FBEW6ECPY2ITTT4N74PANCNFSM5CARWZGQ.

goedman commented 3 years ago

Hi Max,

Thanks for reporting this!

The last few days there have been several of these messages on the MCMCChains lists, e.g. .

This must have been a recent change, probably in AxisArrays.jl but I’m not sure about that.

Anyway, I just pushed StatisticalRethinking v3.4.5 with a fix that, with your (slightly modified) script:

using Turing
using Distributions
using StatisticalRethinking

@model function model(y)
    μ ~ Normal(5, 1)
    σ ~ Uniform(0, 1)
    b ~ Uniform(0, 1)
    y ~ Normal(μ, σ) + b
end

chain = sample(model(10), NUTS(), 1000)
chain |> display

describe(chain)[1] |> display

df = DataFrame(chain);

first(df, 5) |> display
println()

mean(Array(df); dims=1) |> display

returns:

Chains MCMC chain (1000×15×1 Array{Float64, 3}):

Iterations        = 501:1:1500
Number of chains  = 1
Samples per chain = 1000
Wall duration     = 2.04 seconds
Compute duration  = 2.04 seconds
parameters        = b, μ, σ
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size

Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat   ess_per_sec 
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64       Float64 

           μ    7.5998    0.8290     0.0262    0.0531   287.5704    0.9997      140.6900
           σ    0.8078    0.1900     0.0060    0.0169   135.2522    1.0028       66.1704
           b    0.6979    0.2495     0.0079    0.0127   384.3919    0.9994      188.0587

...

5×3 DataFrame
 Row │ μ        σ         b        
     │ Float64  Float64   Float64  
─────┼─────────────────────────────
   1 │ 7.38799  0.648949  0.540874
   2 │ 7.89265  0.676215  0.698801
   3 │ 7.18477  0.955431  0.866246
   4 │ 6.34593  0.973954  0.746923
   5 │ 6.38664  0.915485  0.67176

1×3 Matrix{Float64}:
 7.59975  0.807811  0.697935
goedman commented 3 years ago

On your last point, if I comment out using StatisticalRethinking and restart Julia, I get:

Chains MCMC chain (1000×15×1 Array{Float64, 3}):

Iterations        = 501:1:1500
Number of chains  = 1
Samples per chain = 1000
Wall duration     = 10.04 seconds
Compute duration  = 10.04 seconds
parameters        = b, μ, σ
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size

Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat   ess_per_sec 
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64       Float64 

           μ    7.6713    0.7585     0.0240    0.0359   410.9653    0.9997       40.9491
           σ    0.7941    0.1679     0.0053    0.0123   243.6631    1.0020       24.2789
           b    0.7039    0.2333     0.0074    0.0075   764.9915    0.9991       76.2247

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol   Float64   Float64   Float64   Float64   Float64 

           μ    6.0289    7.2054    7.6569    8.2029    8.9793
           σ    0.3788    0.6954    0.8396    0.9285    0.9961
           b    0.1468    0.5481    0.7664    0.8898    0.9899

Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat   ess_per_sec 
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64       Float64 

           μ    7.6713    0.7585     0.0240    0.0359   410.9653    0.9997       40.9491
           σ    0.7941    0.1679     0.0053    0.0123   243.6631    1.0020       24.2789
           b    0.7039    0.2333     0.0074    0.0075   764.9915    0.9991       76.2247

ERROR: LoadError: UndefVarError: DataFrame not defined
Stacktrace:
 [1] top-level scope
   @ ~/.julia/dev/StatisticalRethinkingTuring/test/test_dataframe.jl:30
 [2] include(fname::String)
   @ Base.MainInclude ./client.jl:444
 [3] top-level scope
   @ REPL[1]:1
in expression starting at /Users/rob/.julia/dev/StatisticalRethinkingTuring/test/test_dataframe.jl:30

julia> convert(DataFrame, chain)
ERROR: UndefVarError: DataFrame not defined
Stacktrace:
 [1] top-level scope
   @ REPL[1]:1

I indeed had expected a DataFrame conversion would be implemented.

What does work is:

julia> cdf = describe(chain)
2-element Vector{ChainDataFrame}:
 Summary Statistics (3 x 8)
 Quantiles (3 x 6)

julia> DataFrame(cdf[1])
3×8 DataFrame
 Row │ parameters  mean      std       naive_se    mcse        ess      rhat      ess_per_sec 
     │ Symbol      Float64   Float64   Float64     Float64     Float64  Float64   Float64     
─────┼────────────────────────────────────────────────────────────────────────────────────────
   1 │ μ           7.67127   0.758478  0.0239852   0.0358895   410.965  0.999715      40.9491
   2 │ σ           0.794146  0.167938  0.00531066  0.012331    243.663  1.002         24.2789
   3 │ b           0.703884  0.23332   0.00737823  0.00754575  764.991  0.999103      76.2247

julia> Tables.columnnames(chain)
(:iteration, :chain, :μ, :σ, :b, :lp, :n_steps, :is_accept, :acceptance_rate, :log_density, :hamiltonian_energy, :hamiltonian_energy_error, :max_hamiltonian_energy_error, :tree_depth, :numerical_error, :step_size, :nom_step_size)

julia> DataFrame(cdf[1])
3×8 DataFrame
 Row │ parameters  mean      std       naive_se    mcse        ess      rhat      ess_per_sec 
     │ Symbol      Float64   Float64   Float64     Float64     Float64  Float64   Float64     
─────┼────────────────────────────────────────────────────────────────────────────────────────
   1 │ μ           7.67127   0.758478  0.0239852   0.0358895   410.965  0.999715      40.9491
   2 │ σ           0.794146  0.167938  0.00531066  0.012331    243.663  1.002         24.2789
   3 │ b           0.703884  0.23332   0.00737823  0.00754575  764.991  0.999103      76.2247

But this is very different.

Shmuma commented 3 years ago

Hi! Thanks for fixing!

After couple of issues with julia registry caching I was able to install the new version. Everything works, thanks!

As a side note: wouldn't it be simpler to implement DataFrame conversion like this:

DataFrame(m::MCMCChains.Chains) = DataFrame(Dict(n => vec(m[n].data) for n in names(m,:parameters)))

It will work with any amount of params and not need explicit conversion to Array

Shmuma commented 3 years ago

On your last point, if I comment out using StatisticalRethinking and restart Julia, I get: ERROR: LoadError: UndefVarError: DataFrame not defined

Ah, this happens due to not imported DataFrames package. There is a working example:

using Turing
using Distributions
using DataFrames

@model function model(y)
    μ ~ Normal(5, 1)
    σ ~ Uniform(0, 1)
    b ~ Uniform(0, 1)
    y ~ Normal(μ, σ) + b
end

chain = sample(model(10), NUTS(), 1000)
chain |> display

describe(chain)[1] |> display

df = DataFrame(chain);

first(df, 5) |> display
println()

mean(Array(df); dims=1) |> display
goedman commented 3 years ago

Thanks, yes you are right, if I comment out SR, DataFrames is no longer loaded.

I also do like your suggestion and need to think about:

DataFrame(m::MCMCChains.Chains, section::Symbol) =
    DataFrame(Dict(n => vec(m[n].data) for n in names(m,section)))

or

DataFrame(m::MCMCChains.Chains, section=:parameters) =
    DataFrame(Dict(n => vec(m[n].data) for n in names(m,section)))
Shmuma commented 3 years ago

Ok, in that case I'll keep the issue open as a reminder for a potential improvement :)