StatisticalRethinkingJulia / StatisticalRethinking.jl

Julia package with selected functions in the R package `rethinking`. Used in the SR2... projects.
MIT License
385 stars 32 forks source link

Function precis() is not robust to missing values #119

Closed Shmuma closed 3 years ago

Shmuma commented 3 years ago

Hi!

I found a problem with precis function, which crashes if dataframe contains missing values. In section 4.5 of the book, "cherry blossoms" dataset is being analized, where 50% of cases contain missing values.

How to reproduce: You can take the data here: https://github.com/Shmuma/rethinking-2ed-julia/blob/main/data/cherry_blossoms.csv

using CSV
using DataFrames
using StatisticalRethinking

d = DataFrame(CSV.File("cherry_blossoms.csv", missingstring="NA"))
println(first(d, 5))
precis(d)

The result is

10×5 DataFrame
 Row │ year   doy      temp      temp_upper  temp_lower 
     │ Int64  Int64?   Float64?  Float64?    Float64?   
─────┼──────────────────────────────────────────────────
   1 │   801  missing   missing     missing     missing 
   2 │   802  missing   missing     missing     missing 
   3 │   803  missing   missing     missing     missing 
   4 │   804  missing   missing     missing     missing 
   5 │   805  missing   missing     missing     missing 
   6 │   806  missing   missing     missing     missing 
   7 │   807  missing   missing     missing     missing 
   8 │   808  missing   missing     missing     missing 
   9 │   809  missing   missing     missing     missing 
  10 │   810  missing   missing     missing     missing 
ArgumentError: quantiles are undefined in presence of missing values

Stacktrace:
  [1] _quantilesort!(v::Vector{Union{Missing, Int64}}, sorted::Bool, minp::Float64, maxp::Float64)
    @ Statistics /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Statistics/src/Statistics.jl:959
  [2] #quantile!#52
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Statistics/src/Statistics.jl:943 [inlined]
  [3] quantile(v::Vector{Union{Missing, Int64}}, p::Float64; sorted::Bool, alpha::Float64, beta::Float64)
    @ Statistics /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Statistics/src/Statistics.jl:1052
  [4] quantile(v::Vector{Union{Missing, Int64}}, p::Float64)
    @ Statistics /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Statistics/src/Statistics.jl:1052
  [5] _broadcast_getindex_evalf
    @ ./broadcast.jl:648 [inlined]
  [6] _broadcast_getindex
    @ ./broadcast.jl:621 [inlined]
  [7] getindex
    @ ./broadcast.jl:575 [inlined]
  [8] copyto_nonleaf!(dest::Vector{Float64}, bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(quantile), Tuple{Base.Broadcast.Extruded{Vector{AbstractVector{T} where T}, Tuple{Bool}, Tuple{Int64}}, Float64}}, iter::Base.OneTo{Int64}, state::Int64, count::Int64)
    @ Base.Broadcast ./broadcast.jl:1078
  [9] copy
    @ ./broadcast.jl:930 [inlined]
 [10] materialize
    @ ./broadcast.jl:883 [inlined]
 [11] precis(df::DataFrame; io::IJulia.IJuliaStdio{Base.PipeEndpoint}, digits::Int64, depth::Float64, alpha::Float64)
    @ StatisticalRethinking ~/.julia/packages/StatisticalRethinking/VpDQW/src/precis.jl:26
 [12] precis(df::DataFrame)
    @ StatisticalRethinking ~/.julia/packages/StatisticalRethinking/VpDQW/src/precis.jl:22
 [13] top-level scope
    @ In[23]:3
 [14] eval
    @ ./boot.jl:360 [inlined]
 [15] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1116

If I drop rows with missing values, only 787 of 1215 cases will be left. Precis will work, but produce result different from the book

dd = disallowmissing(d[completecases(d),:]);
println(size(d), size(dd))
precis(dd)

The output is

(1215, 5)(787, 5)
┌────────────┬──────────────────────────────────────────────────────────┐
│      param │    mean      std     5.5%     50%    94.5%     histogram │
├────────────┼──────────────────────────────────────────────────────────┤
│       year │  1533.4  291.123  1008.61  1563.0  1935.77  ▁▃▃▅▅▅█████▇ │
│        doy │ 104.921   6.2577     95.0   105.0    115.0      ▁▁▅██▅▂▁ │
│       temp │  6.1004   0.6834      5.1    6.06   7.3662      ▁▅▆█▄▂▁▁ │
│ temp_upper │  6.9376    0.812     5.81     6.8   8.4177      ▂█▆▂▁▁▁▁ │
│ temp_lower │  5.2635   0.7622   4.1923    5.25   6.6277   ▁▁▁▃▆█▅▂▁▁▁ │
└────────────┴──────────────────────────────────────────────────────────┘
Shmuma commented 3 years ago

New version produces this output, which is the same as in the book

┌────────────┬─────────────────────────────────────────────────────────┐
│      param │   mean      std    5.5%     50%    94.5%      histogram │
├────────────┼─────────────────────────────────────────────────────────┤
│       year │ 1408.0  350.885  867.77  1408.0  1948.23  ████████████▂ │
│        doy │ 104.54    6.407   94.43   105.0    115.0       ▁▂▆██▅▂▁ │
│       temp │ 6.1419   0.6636    5.15     6.1   7.2947       ▁▄▆█▄▂▁▁ │
│ temp_upper │ 7.1852   0.9929  5.8977    7.04   8.9023       ▂██▃▁▁▁▁ │
│ temp_lower │ 5.0989   0.8503  3.7876   5.145     6.37       ▁▁▁▂▇█▂▁ │
└────────────┴─────────────────────────────────────────────────────────┘