cosname / cosx.org

统计之都主站
https://cosx.org
MIT License
265 stars 239 forks source link

概率编程语言 Stan #830

Closed XiangyunHuang closed 5 years ago

XiangyunHuang commented 5 years ago

官网简介

Stan is a state-of-the-art platform for statistical modeling and high-performance statistical computation. Thousands of users rely on Stan for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences, engineering, and business.

Users specify log density functions in Stan’s probabilistic programming language and get:


写作动机: Stan 发布 2.19.0 继 支持 MPI 后终于开始 支持 GPU 关注其发展一年多了,终于到了该下笔介绍的时候了

Stan 的接口有很多,如 rstanpystancmdstan 等共计9种,由于对 cmdstan 功能支持最为全面,因此文章会介绍这个接口的使用,而且学完这个, rstan 就比较容易了

作为入门篇,首先介绍单机 CPU 版 ,此时写作环境如下

  1. cmdstan 2.19.0 包含 boost 1.69.0 eigen 3.3.3 sundials 4.1.0
  2. 虚拟机平台 Fedora 29 Server 64位 GCC/G++ 8.3.1
  3. 内存 5G CPU 单核双线程 i7-4710MQ 主频 2.5 GHz

分以下部分介绍

  1. Stan 的前世今生
  2. 如何安装使用 stan,默认安装和自定义安装
  3. 以简单的二项模型为例,贝叶斯估计之HMC 采样算法与 MLE 估计比较
  4. HMC 算法各个参数的解释,一并介绍理论
  5. 含有超参数的复杂空间模型之 Stan 实现
XiangyunHuang commented 5 years ago

安装

  1. 下载软件压缩包

    curl -fLo ./cmdstan-2.19.tar.gz https://github.com/stan-dev/cmdstan/releases/download/v2.19.0/cmdstan-2.19.0.tar.gz
  2. 解压

    tar -xzf cmdstan-2.19.0.tar.gz
    
    cd cmdstan-2.19.0
    
    tree -L 2 .
    .
    ├── bin
    │   ├── cmdstan
    │   ├── diagnose
    │   ├── print
    │   ├── stanc
    │   └── stansummary
    ├── examples
    │   └── bernoulli
    ├── Jenkinsfile
    ├── LICENSE
    ├── make
    │   ├── command
    │   ├── program
    │   ├── stanc
    │   └── tests
    ├── makefile
    ├── README.md
    ├── runCmdStanTests.py
    ├── src
    │   ├── cmdstan
    │   ├── docs
    │   └── test
    ├── stan
    │   ├── Jenkinsfile
    │   ├── lib
    │   ├── LICENSE.md
    │   ├── licenses
    │   ├── make
    │   ├── makefile
    │   ├── README.md
    │   ├── RELEASE-NOTES.txt
    │   ├── runTests.py
    │   └── src
    └── test-all.sh
    
    14 directories, 20 files
  3. 编译

    cd cmdstan-2.19.0
    make build

    最后显示

    --- CmdStan v2.19.0 built ---

    表示编译成功,中间没有任何警告和错误

测试

伯努利分布模型: Y 服从 0-1 分布,观测数据共10个,现在需要估计其参数 theta

├── bernoulli  编译后生成的可执行文件
├── bernoulli.d
├── bernoulli.data.json 数据文件 JSON 格式
├── bernoulli.data.R 数据文件 R 代码
├── bernoulli.hpp   编译后生成的头文件
└── bernoulli.stan  模型文件
cat examples/bernoulli/bernoulli.data.json
{
    "N" : 10,
    "y" : [0,1,0,0,0,0,0,0,0,1]
}
cat bernoulli.data.R
N <- 10
y <- c(0,1,0,0,0,0,0,0,0,1)
cat examples/bernoulli/bernoulli.stan
# 定义数据
data {
  int<lower=0> N;
  int<lower=0,upper=1> y[N];
}
# 定义参数
parameters {
  real<lower=0,upper=1> theta;
}
# 定义模型结构
model {
  theta ~ beta(1,1);
  for (n in 1:N)
    y[n] ~ bernoulli(theta);
}

bernoulli.d 文件内容

head -n 10 bernoulli.d
examples/bernoulli/bernoulli.o examples/bernoulli/bernoulli: \
 examples/bernoulli/bernoulli.hpp examples/bernoulli/bernoulli.hpp \
 stan/src/stan/model/model_header.hpp stan/lib/stan_math/stan/math.hpp \
 stan/lib/stan_math/stan/math/rev/mat.hpp \
 stan/lib/stan_math/stan/math/rev/core.hpp \
 stan/lib/stan_math/stan/math/rev/core/autodiffstackstorage.hpp \
 stan/lib/stan_math/stan/math/memory/stack_alloc.hpp \
 stan/lib/stan_math/stan/math/prim/scal/meta/likely.hpp \
 stan/lib/stan_math/stan/math/rev/core/build_vari_array.hpp \
 stan/lib/stan_math/stan/math/prim/mat/fun/Eigen.hpp \
...
# bernoulli.stan 文件翻译成 C++ 文件,然后编译成可执行文件 bernoulli
make examples/bernoulli/bernoulli
# 采样结果 output.csv 存放在当前目录下
examples/bernoulli/bernoulli sample data file=examples/bernoulli/bernoulli.data.R
method = sample (Default)
  sample
    num_samples = 1000 (Default) # 采样 1000 个
    num_warmup = 1000 (Default) # 预处理 1000 个
    save_warmup = 0 (Default)
    thin = 1 (Default) # 采样间隔 1
    adapt
      engaged = 1 (Default)
      gamma = 0.050000000000000003 (Default)
      delta = 0.80000000000000004 (Default)
      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
    algorithm = hmc (Default)  # HMC 算法参数
      hmc
        engine = nuts (Default)
          nuts
            max_depth = 10 (Default)
        metric = diag_e (Default)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 0 (Default)
id = 0 (Default)
data
  file = examples/bernoulli/bernoulli.data.R
init = 2 (Default)
random
  seed = 53280552 # 随机数种子,用来下次重复结果
output
  file = output.csv (Default)
  diagnostic_file =  (Default)
  refresh = 100 (Default)

Gradient evaluation took 1.2e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Adjust your expectations accordingly!

Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  100 / 2000 [  5%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  300 / 2000 [ 15%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  500 / 2000 [ 25%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  700 / 2000 [ 35%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration:  900 / 2000 [ 45%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1100 / 2000 [ 55%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1300 / 2000 [ 65%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1500 / 2000 [ 75%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1700 / 2000 [ 85%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 1900 / 2000 [ 95%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.009446 seconds (Warm-up)
               0.015629 seconds (Sampling)
               0.025075 seconds (Total)
# output.csv 是采样过程中收集的原始数据
bin/stansummary output.csv
Inference for Stan model: bernoulli_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (0.0094) seconds, 0.0094 seconds total
Sampling took (0.016) seconds, 0.016 seconds total

                Mean     MCSE   StdDev     5%   50%   95%    N_Eff  N_Eff/s    R_hat
lp__            -7.4  7.1e-02  9.5e-01   -9.2  -7.0  -6.8  1.8e+02  1.1e+04  1.0e+00
accept_stat__   0.93  3.4e-03  1.1e-01   0.69  0.98   1.0  1.0e+03  6.6e+04  1.0e+00
stepsize__       1.0  4.7e-15  3.3e-15    1.0   1.0   1.0  5.0e-01  3.2e+01  1.0e+00
treedepth__      1.4  1.5e-02  4.9e-01    1.0   1.0   2.0  1.0e+03  6.5e+04  1.0e+00
n_leapfrog__     2.5  4.9e-02  1.4e+00    1.0   3.0   7.0  8.0e+02  5.1e+04  1.0e+00
divergent__     0.00     -nan  0.0e+00   0.00  0.00  0.00     -nan     -nan     -nan
energy__         7.9  8.4e-02  1.2e+00    6.8   7.5    10  2.0e+02  1.3e+04  1.0e+00
theta           0.25  7.0e-03  1.3e-01  0.063  0.23  0.48  3.3e+02  2.1e+04  1.0e+00

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at
convergence, R_hat=1).

与频率派的最大似然估计的比较

参考资料

  1. Getting Started with CmdStan
  2. rstudio-1-2-preview-stan
  3. 空间广义线性混合效应模型及其在 Geostatistics 中的应用