IBM / yaps

A surface language for programming Stan models using python syntax
https://yaps.readthedocs.io
Apache License 2.0
46 stars 9 forks source link

using yaps to access stansummary command #3

Open sreedta8 opened 3 years ago

sreedta8 commented 3 years ago

Hi

I learned about yaps for the first time today and I could install yaps (with its dependencies) and cmdstan and successfully run the example model of the coin toss. In addition to examining the parameter theta, I want to execute the stansummary command on the output.csv created by cmdstan which summarizes all the parameters. Please see the desired output I get from cmdstan

(base) sz-HP15-Nb:~/cmdstan$ bin/stansummary output.csv
Input file: output.csv
Inference for Stan model: bernoulli_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took 0.0070 seconds
Sampling took 0.013 seconds

                Mean     MCSE   StdDev     5%   50%   95%    N_Eff  N_Eff/s    R_hat

lp__            -7.3  4.6e-02     0.72   -8.7  -7.0  -6.8      243    18684      1.0
accept_stat__   0.94  2.6e-03  8.3e-02   0.77  0.98   1.0  1.0e+03  7.7e+04  1.0e+00
stepsize__      0.86      nan  4.3e-15   0.86  0.86  0.86      nan      nan      nan
treedepth__      1.5  1.9e-02  6.0e-01    1.0   1.0   3.0  9.9e+02  7.6e+04  1.0e+00
n_leapfrog__     3.2  6.1e-02  1.9e+00    1.0   3.0   7.0  1.0e+03  7.8e+04  1.0e+00
divergent__     0.00      nan  0.0e+00   0.00  0.00  0.00      nan      nan      nan
energy__         7.8  6.2e-02  1.0e+00    6.8   7.5   9.7  2.6e+02  2.0e+04  1.0e+00

theta           0.24  5.6e-03     0.12  0.074  0.22  0.45      434    33358      1.0

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at 
convergence, R_hat=1).
mandel commented 3 years ago

Thank you for your interest in the project.

I think that you could directly use pycmdstan to execute this command from Python (https://pycmdstan.readthedocs.io/en/latest/api.html#pycmdstan.model.stansummary_csv). Would it work for you?

gbdrt commented 3 years ago

Hi,

See #4 for accessing the pycmdstan original API. Then to extract the summary you can, e.g., use pandas. Looking at the shape of the pycmdstan summary, maybe something like:

import pandas as pd
import pycmdstan

cmdstan_model = pycmdstan.Model(code=str(constrained_coin.model))
run = cmdstan_model.sample(data=constrained_coin.data)
run_set = pycmdstan.model.RunSet(run)

pd.DataFrame({k:list(v[0]) for k,v in run_set.summary.items()}, index=run_set.summary['lp__'].dtype.fields).transpose()

Output:

Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -7.26406 0.0398246 0.706619 -8.8589 -6.99614 -6.75105 314.824 5247.06 1.00213
accept_stat__ 0.920701 0.00386555 0.122791 0.669269 0.972133 1 1009.04 16817.4 1.00397
stepsize__ 0.943827 nan 2.55479e-15 0.943827 0.943827 0.943827 nan nan nan
treedepth__ 1.343 0.0178697 0.474949 1 1 2 706.417 11773.6 1.00135
n_leapfrog__ 2.39 0.037044 0.99644 1 3 3 723.546 12059.1 1.00239
divergent__ 0 nan 0 0 0 0 nan nan nan
energy__ 7.74784 0.0576719 0.980427 6.79351 7.45342 9.80426 289.003 4816.72 1.0046
theta 0.252494 0.00762163 0.120613 0.0840843 0.231581 0.476417 250.432 4173.86 1.01409
sreedta8 commented 3 years ago

@mandel @gbdrt thank you both for your responses clarifying further and with examples. I'm coming from an applied statistical domain of modeling, segmentation, and quasi-experimentation with many years of experience. I'm an intermediary level programmer in Python, learning by example. I have been using rstan, pystan and pymc3 quite a bit and in one of those posts I read about yaps which is really exciting as I do find Python to be a lot more approachable than C++ or Java.

Your examples are very helpful as I will look to develop some Bayesian modeling examples (Linear Mixed Effects, GLMs, GLMMs,and GAMs and share them back here with you so that others can use these as starters.

sreedta8 commented 3 years ago

@gbdrt I have tried to run the model you shared and I get an error about lib.cmdstan_path("/path/to"). Please see below (the last two lines)

Exception in thread Thread-6:
Traceback (most recent call last):
  File "/home/sz/miniconda3/envs/yapyst/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/sz/miniconda3/envs/yapyst/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/sz/miniconda3/envs/yapyst/lib/python3.7/site-packages/pycmdstan/model.py", line 311, in _build_summary
    self.niter, self._summary = stansummary_csv(run_csvs)
  File "/home/sz/miniconda3/envs/yapyst/lib/python3.7/site-packages/pycmdstan/model.py", line 74, in stansummary_csv
    exe = os.path.join(_find_cmdstan(), 'bin', 'stansummary')
  File "/home/sz/miniconda3/envs/yapyst/lib/python3.7/site-packages/pycmdstan/model.py", line 26, in _find_cmdstan
    'please provide CmdStan path, e.g. lib.cmdstan_path("/path/to/")')
pycmdstan.model.CmdStanNotFound: please provide CmdStan path, e.g. lib.cmdstan_path("/path/to/")

I have already defined a CMDSTAN variable set to cmdstan location CMDSTAN="~/cmdstan/" and can run the same model directly from cmdstan.

mandel commented 3 years ago

What is your OS? How do you launch the example?

Instead of doing simply CMDSTAN="~/cmdstan/", could you try to do

export CMDSTAN=$HOME/cmdstan
sreedta8 commented 3 years ago

I'm working with Ubuntu 20.04LTS. I did it both ways. At the terminal, I used export CMDSTAN=$HOME/cmdstan and in the notebook I used CMDSTAN="~/cmdstan/"

Here is my entire code:

(yapyst) sz@sz-HP15-Nb:~$ export CMDSTAN=$HOME/cmdstan
(yapyst) sz@sz-HP15-Nb:~$ echo $CMDSTAN
/home/sz/cmdstan
import yaps
import pandas as pd
import pycmdstan
import numpy as np
from yaps.lib import int, real, uniform, bernoulli
from matplotlib import pyplot as plt
import logging
logging.getLogger('pycmdstan.model').setLevel(logging.CRITICAL)

@yaps.model
def coin(x: int(lower=0, upper=1)[10]):
    theta: real(lower=0, upper=1) <~ uniform(0, 1)
    for i in range(1,11):
        x[i] <~ bernoulli(theta)

flips = np.array([0, 1, 0, 0, 0, 0, 0, 0, 0, 1])
constrained_coin = coin(x=flips)
#constrained_coin.sample(data=constrained_coin.data)
cmdstan_model = pycmdstan.Model(code=str(constrained_coin.model))
run = cmdstan_model.sample(data=constrained_coin.data)
run_set = pycmdstan.model.RunSet(run)

Error Output is as follows:

INFO:filelock:Lock 139923306816656 acquired on /home/sreezach/.cache/pycmdstan/model-796d8029.lock
INFO:filelock:Lock 139923306816656 released on /home/sreezach/.cache/pycmdstan/model-796d8029.lock
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/sz/miniconda3/envs/yapyst/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/sz/miniconda3/envs/yapyst/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/sz/miniconda3/envs/yapyst/lib/python3.7/site-packages/pycmdstan/model.py", line 311, in _build_summary
    self.niter, self._summary = stansummary_csv(run_csvs)
  File "/home/sz/miniconda3/envs/yapyst/lib/python3.7/site-packages/pycmdstan/model.py", line 74, in stansummary_csv
    exe = os.path.join(_find_cmdstan(), 'bin', 'stansummary')
  File "/home/sz/miniconda3/envs/yapyst/lib/python3.7/site-packages/pycmdstan/model.py", line 26, in _find_cmdstan
    'please provide CmdStan path, e.g. lib.cmdstan_path("/path/to/")')
pycmdstan.model.CmdStanNotFound: please provide CmdStan path, e.g. lib.cmdstan_path("/path/to/")
mandel commented 3 years ago

@sreedta8 Does the command $CMDSTAN/bin/stanc --version works for you?

mandel commented 3 years ago

Maybe you could also try to add $CMDSTAN/bin to your path:

export PATH=$CMDSTAN/bin:$PATH
sreedta8 commented 3 years ago

@mandel @gbdrt

I solved the error with the following changes:

At the beginning (where the import statements are) I added

import os

then later I added the following statements before executing the model sampling

os.environ['CMDSTAN'] = "/home/sz/cmdstan"
path = os.environ.get('CMDSTAN', 'cmdstan')
flips = np.array([0, 1, 0, 0, 0, 0, 0, 0, 0, 1])
constrained_coin = coin(x=flips)
#constrained_coin.sample(data=constrained_coin.data)
cmdstan_model = pycmdstan.Model(code=str(constrained_coin.model))
run = cmdstan_model.sample(data=constrained_coin.data)
run_set = pycmdstan.model.RunSet(run)

Output is as follows without errors:

INFO:filelock:Lock 140521467452752 acquired on /home/sreezach/.cache/pycmdstan/model-796d8029.lock
INFO:filelock:Lock 140521467452752 released on /home/sreezach/.cache/pycmdstan/model-796d8029.lock

Then I used pandas to summarize (using code provided by @gbdrt): pd.DataFrame({k:list(v[0]) for k,v in run_set.summary.items()}, index=run_set.summary['lp__'].dtype.fields).transpose()

with this tabular output

  | Mean | MCSE | StdDev | 5% | 50% | 95% | N_Eff | N_Eff/s | R_hat
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
-7.346210 | 0.031585 | 0.795877 | -8.830190 | -7.061580 | -6.753910 | 634.943 | 28861.0 | 1.004300
0.883828 | 0.005207 | 0.171573 | 0.480840 | 0.960950 | 1.000000 | 1085.830 | 49356.1 | 0.999064
1.260740 | NaN | 0.000000 | 1.260740 | 1.260740 | 1.260740 | NaN | NaN | NaN
1.447000 | 0.016882 | 0.497432 | 1.000000 | 1.000000 | 2.000000 | 868.212 | 39464.2 | 0.999198
2.214000 | 0.034800 | 0.989537 | 1.000000 | 3.000000 | 3.000000 | 808.527 | 36751.2 | 0.999004
0.000000 | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN | NaN
7.839050 | 0.045891 | 1.036690 | 6.812870 | 7.571700 | 9.768000 | 510.325 | 23196.6 | 1.003260
0.259071 | 0.006663 | 0.129156 | 0.074004 | 0.243164 | 0.485656 | 375.747 | 17079.4 | 0.999056
mandel commented 3 years ago

Thank you @sreedta8 for finding a solution!

sreedta8 commented 3 years ago

@mandel thank you for taking the time to help with your feedback.

gbdrt commented 3 years ago

@sreedta8 that is a bit strange. Is it possible that you start the jupyter server before the export CMDSTAN=...?

Jupyter should inherit local env variables (I cannot reproduce your error). That being said using os.environ in the notebook, is probably more robust.

sreedta8 commented 3 years ago

@gbdrt you surmised correctly. I did do that. I got the code about using the os.environ from another Stan development called CmdStanPy (a python wrapper for running CmdStan https://github.com/stan-dev/cmdstanpy )

Thanks again for your help - Sree

gbdrt commented 3 years ago

Yes CmdStanPy is great! Unfortunately, this interface requires Stan files (you cannot pass the model as a string). But, as with pycmdstan, you can also use it with YAPS by manually generating the files.

from cmdstanpy import CmdStanModel

with open("coin.stan", "w") as model_file:
    model_file.write(str(constrained_coin.model))

coin_model = CmdStanModel(stan_file="coin.stan")
coin_fit = coin_model.sample(data=constrained_coin.data)
print(coin_fit.summary())

Output:

name Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -7.27994 0.0229256 0.781747 -8.86246 -6.9734 -6.75011 1162.76 3125.7 1.00471
theta 0.246195 0.00352571 0.117963 0.0750031 0.234836 0.464874 1119.43 3009.23 1.0033

If you find this useful, we could update the YAPS wrapper to use CmdStanPy. (One issue though, is that the installation process is not as easy as with pycmdstan).