Open alashworth opened 5 years ago
Comment by syclik Tuesday Jul 28, 2015 at 00:34 GMT
@bgoodri, what if we pushed the adaptation parameters before the header line (lp__,accept_stat__,...
)?
@betanalpha, would that offend you?
I think that'll be doable. We could toss that extra line at the end before pushing out the timing info.
Comment by bgoodri Tuesday Jul 28, 2015 at 00:51 GMT
What if someone chooses to save the warmup?
On Mon, Jul 27, 2015, 8:34 PM Daniel Lee notifications@github.com wrote:
@bgoodri https://github.com/bgoodri, what if we pushed the adaptation parameters before the header line (lp__,accept_stat__,...)? @betanalpha https://github.com/betanalpha, would that offend you?
I think that'll be doable. We could toss that extra line at the end before pushing out the timing info.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-125388415.
Comment by syclik Tuesday Jul 28, 2015 at 03:19 GMT
Damn. You're right. We could output to a different file. Move that to diagnostic info.
On Monday, July 27, 2015, bgoodri notifications@github.com wrote:
What if someone chooses to save the warmup?
On Mon, Jul 27, 2015, 8:34 PM Daniel Lee <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:
@bgoodri https://github.com/bgoodri, what if we pushed the adaptation parameters before the header line (lp__,accept_stat__,...)? @betanalpha https://github.com/betanalpha, would that offend you?
I think that'll be doable. We could toss that extra line at the end before pushing out the timing info.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-125388415.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-125396365.
Comment by bgoodri Tuesday Jul 28, 2015 at 03:38 GMT
Fine by me but would need a corresponding change to read_stan_csv in rstan
On Mon, Jul 27, 2015 at 11:19 PM, Daniel Lee notifications@github.com wrote:
Damn. You're right. We could output to a different file. Move that to diagnostic info.
On Monday, July 27, 2015, bgoodri notifications@github.com wrote:
What if someone chooses to save the warmup?
On Mon, Jul 27, 2015, 8:34 PM Daniel Lee <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:
@bgoodri https://github.com/bgoodri, what if we pushed the adaptation parameters before the header line (lp__,accept_stat__,...)? @betanalpha https://github.com/betanalpha, would that offend you?
I think that'll be doable. We could toss that extra line at the end before pushing out the timing info.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-125388415.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-125396365.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-125421450.
Comment by bgoodri Monday Aug 17, 2015 at 22:58 GMT
Moving the adaptation stuff to the diagnostic file would allow me to unify the read_stan_csv() function for the MCMC and VB output.
Comment by betanalpha Monday Aug 17, 2015 at 23:25 GMT
The diagnostic file is meant for HMC diagnostics and to be entirely optional. read_stan_csv is not built for handling MCMC and VB output (let alone optimization output) and we shouldn’t be hacking it all together. We need a valid spec that can handle generic interfaces, and then we can define a Stan output format and a generic output reader. We need to stop hacking and start refactoring.
On Aug 17, 2015, at 3:58 PM, bgoodri notifications@github.com wrote:
Moving the adaptation stuff to the diagnostic file would allow me to unify the read_stan_csv() function for the MCMC and VB output.
— Reply to this email directly or view it on GitHub.
Comment by bgoodri Monday Aug 17, 2015 at 23:48 GMT
What do you mean read_stan_csv is not built for handling MCMC output? It has been handling the MCMC output produced by Stan since 2012. The only difference is that VB outputs different comments, which is annoying to deal with, and the reading is really slow if you specify the comment character as # because it has to read each line one character at a time.
On Mon, Aug 17, 2015 at 7:25 PM, Michael Betancourt < notifications@github.com> wrote:
The diagnostic file is meant for HMC diagnostics and to be entirely optional. read_stan_csv is not built for handling MCMC and VB output (let alone optimization output) and we shouldn’t be hacking it all together. We need a valid spec that can handle generic interfaces, and then we can define a Stan output format and a generic output reader. We need to stop hacking and start refactoring.
On Aug 17, 2015, at 3:58 PM, bgoodri notifications@github.com wrote:
Moving the adaptation stuff to the diagnostic file would allow me to unify the read_stan_csv() function for the MCMC and VB output.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-131991031.
Comment by betanalpha Tuesday Aug 18, 2015 at 00:00 GMT
MCMC and VB/optimization. It was only ever written for MCMC output, everything since has been hacks.
On Aug 17, 2015, at 4:48 PM, bgoodri notifications@github.com wrote:
What do you mean read_stan_csv is not built for handling MCMC output? It has been handling the MCMC output produced by Stan since 2012. The only difference is that VB outputs different comments, which is annoying to deal with, and the reading is really slow if you specify the comment character as # because it has to read each line one character at a time.
On Mon, Aug 17, 2015 at 7:25 PM, Michael Betancourt < notifications@github.com> wrote:
The diagnostic file is meant for HMC diagnostics and to be entirely optional. read_stan_csv is not built for handling MCMC and VB output (let alone optimization output) and we shouldn’t be hacking it all together. We need a valid spec that can handle generic interfaces, and then we can define a Stan output format and a generic output reader. We need to stop hacking and start refactoring.
On Aug 17, 2015, at 3:58 PM, bgoodri notifications@github.com wrote:
Moving the adaptation stuff to the diagnostic file would allow me to unify the read_stan_csv() function for the MCMC and VB output.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-131991031.
— Reply to this email directly or view it on GitHub.
Comment by bgoodri Tuesday Aug 18, 2015 at 00:07 GMT
It reads VB output easier than it reads MCMC output because of the latter's original hack of putting the adaptation info behind a # in the middle of the file when the warmup is saved. If we remove that hack, then it would work just fine.
On Mon, Aug 17, 2015 at 8:00 PM, Michael Betancourt < notifications@github.com> wrote:
MCMC and VB/optimization. It was only ever written for MCMC output, everything since has been hacks.
On Aug 17, 2015, at 4:48 PM, bgoodri notifications@github.com wrote:
What do you mean read_stan_csv is not built for handling MCMC output? It has been handling the MCMC output produced by Stan since 2012. The only difference is that VB outputs different comments, which is annoying to deal with, and the reading is really slow if you specify the comment character as # because it has to read each line one character at a time.
On Mon, Aug 17, 2015 at 7:25 PM, Michael Betancourt < notifications@github.com> wrote:
The diagnostic file is meant for HMC diagnostics and to be entirely optional. read_stan_csv is not built for handling MCMC and VB output (let alone optimization output) and we shouldn’t be hacking it all together. We need a valid spec that can handle generic interfaces, and then we can define a Stan output format and a generic output reader. We need to stop hacking and start refactoring.
On Aug 17, 2015, at 3:58 PM, bgoodri notifications@github.com wrote:
Moving the adaptation stuff to the diagnostic file would allow me to unify the read_stan_csv() function for the MCMC and VB output.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-131991031.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-131998741.
Comment by betanalpha Tuesday Aug 18, 2015 at 00:28 GMT
That’s not a hack — that’s a sequential stream of the MCMC output as it was always designed.
On Aug 17, 2015, at 5:07 PM, bgoodri notifications@github.com wrote:
It reads VB output easier than it reads MCMC output because of the latter's original hack of putting the adaptation info behind a # in the middle of the file when the warmup is saved. If we remove that hack, then it would work just fine.
On Mon, Aug 17, 2015 at 8:00 PM, Michael Betancourt < notifications@github.com> wrote:
MCMC and VB/optimization. It was only ever written for MCMC output, everything since has been hacks.
On Aug 17, 2015, at 4:48 PM, bgoodri notifications@github.com wrote:
What do you mean read_stan_csv is not built for handling MCMC output? It has been handling the MCMC output produced by Stan since 2012. The only difference is that VB outputs different comments, which is annoying to deal with, and the reading is really slow if you specify the comment character as # because it has to read each line one character at a time.
On Mon, Aug 17, 2015 at 7:25 PM, Michael Betancourt < notifications@github.com> wrote:
The diagnostic file is meant for HMC diagnostics and to be entirely optional. read_stan_csv is not built for handling MCMC and VB output (let alone optimization output) and we shouldn’t be hacking it all together. We need a valid spec that can handle generic interfaces, and then we can define a Stan output format and a generic output reader. We need to stop hacking and start refactoring.
On Aug 17, 2015, at 3:58 PM, bgoodri notifications@github.com wrote:
Moving the adaptation stuff to the diagnostic file would allow me to unify the read_stan_csv() function for the MCMC and VB output.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-131991031.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-131998741.
— Reply to this email directly or view it on GitHub.
Comment by bgoodri Tuesday Aug 18, 2015 at 00:40 GMT
It is a hack to put non comma delimited output into a comma delimited file, particularly in the middle of the file. I agree the adaptation info should be put somewhere because hopefully soon we'll be able to input that. But at the moment, it is only useful as a diagnostic and not useful enough as a diagnostic to warrant reading a comma delimited file one character at a time.
On Mon, Aug 17, 2015 at 8:28 PM, Michael Betancourt < notifications@github.com> wrote:
That’s not a hack — that’s a sequential stream of the MCMC output as it was always designed.
On Aug 17, 2015, at 5:07 PM, bgoodri notifications@github.com wrote:
It reads VB output easier than it reads MCMC output because of the latter's original hack of putting the adaptation info behind a # in the middle of the file when the warmup is saved. If we remove that hack, then it would work just fine.
On Mon, Aug 17, 2015 at 8:00 PM, Michael Betancourt < notifications@github.com> wrote:
MCMC and VB/optimization. It was only ever written for MCMC output, everything since has been hacks.
On Aug 17, 2015, at 4:48 PM, bgoodri notifications@github.com wrote:
What do you mean read_stan_csv is not built for handling MCMC output? It has been handling the MCMC output produced by Stan since 2012. The only difference is that VB outputs different comments, which is annoying to deal with, and the reading is really slow if you specify the comment character as # because it has to read each line one character at a time.
On Mon, Aug 17, 2015 at 7:25 PM, Michael Betancourt < notifications@github.com> wrote:
The diagnostic file is meant for HMC diagnostics and to be entirely optional. read_stan_csv is not built for handling MCMC and VB output (let alone optimization output) and we shouldn’t be hacking it all together. We need a valid spec that can handle generic interfaces, and then we can define a Stan output format and a generic output reader. We need to stop hacking and start refactoring.
On Aug 17, 2015, at 3:58 PM, bgoodri notifications@github.com wrote:
Moving the adaptation stuff to the diagnostic file would allow me to unify the read_stan_csv() function for the MCMC and VB output.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub < https://github.com/stan-dev/stan/issues/1321#issuecomment-131991031>.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-131998741.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-132004035.
Comment by bob-carpenter Tuesday Aug 18, 2015 at 03:47 GMT
I'm with Ben on this one. We need to get information out of comments.
Michael --- is there something you're expecting others to do on this refactor? I'm not even sure what the scope of it is.
On Aug 17, 2015, at 8:40 PM, bgoodri notifications@github.com wrote:
It is a hack to put non comma delimited output into a comma delimited file, particularly in the middle of the file. I agree the adaptation info should be put somewhere because hopefully soon we'll be able to input that. But at the moment, it is only useful as a diagnostic and not useful enough as a diagnostic to warrant reading a comma delimited file one character at a time.
On Mon, Aug 17, 2015 at 8:28 PM, Michael Betancourt < notifications@github.com> wrote:
That’s not a hack — that’s a sequential stream of the MCMC output as it was always designed.
On Aug 17, 2015, at 5:07 PM, bgoodri notifications@github.com wrote:
It reads VB output easier than it reads MCMC output because of the latter's original hack of putting the adaptation info behind a # in the middle of the file when the warmup is saved. If we remove that hack, then it would work just fine.
On Mon, Aug 17, 2015 at 8:00 PM, Michael Betancourt < notifications@github.com> wrote:
MCMC and VB/optimization. It was only ever written for MCMC output, everything since has been hacks.
On Aug 17, 2015, at 4:48 PM, bgoodri notifications@github.com wrote:
What do you mean read_stan_csv is not built for handling MCMC output? It has been handling the MCMC output produced by Stan since 2012. The only difference is that VB outputs different comments, which is annoying to deal with, and the reading is really slow if you specify the comment character as # because it has to read each line one character at a time.
On Mon, Aug 17, 2015 at 7:25 PM, Michael Betancourt < notifications@github.com> wrote:
The diagnostic file is meant for HMC diagnostics and to be entirely optional. read_stan_csv is not built for handling MCMC and VB output (let alone optimization output) and we shouldn’t be hacking it all together. We need a valid spec that can handle generic interfaces, and then we can define a Stan output format and a generic output reader. We need to stop hacking and start refactoring.
On Aug 17, 2015, at 3:58 PM, bgoodri notifications@github.com wrote:
Moving the adaptation stuff to the diagnostic file would allow me to unify the read_stan_csv() function for the MCMC and VB output.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub < https://github.com/stan-dev/stan/issues/1321#issuecomment-131991031>.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-131998741.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1321#issuecomment-132004035.
— Reply to this email directly or view it on GitHub.
Comment by betanalpha Tuesday Aug 18, 2015 at 06:57 GMT
I'm with Ben on this one. We need to get information out of comments.
Michael --- is there something you're expecting others to do on this refactor? I'm not even sure what the scope of it is.
We need our output to define the state of the algorithm and any resulting output. The output is being refactored in the writers and the state will require a refactor of the arguments, nominally the last step in the overall refactoring effort. Until we do this we can�t define a proper spec, and we�ll just be hacking.
In particular the adaptation info will get refactored into key/value writes, which we can format however we want. But this won�t fix Ben�s complaint. Because CmdStan streams output the adaption info will also fall between warmup and sampling and we won�t have a continuous CSV. The only way to try to fix this would be to store variables in memory and then print like RStan does, but we know this doesn�t scale. Moreover, I much prefer a spec that follows the sequential order of the sampler.
Ultimately I just don�t buy the need for fast loading of CSVs. If speed is really a concern then we can build readers that do online calculations and avoid reading everything into memory in the first place.
Comment by bgoodri Tuesday Aug 18, 2015 at 11:13 GMT
Fast loading of CSVs is needed for a shinystan app that uploads a .csv file and creates an object that it can launch on while the meter is running. Moreover, extremely large .csv files are effectively unloadable even locally because it can take hours and thereby force users to do things like https://groups.google.com/d/msg/stan-users/urZeEMjjPHM/R0S41-hQbPMJ . Not-too-large .csv files can be effectively unloadable on a Network File System.
It is only your preference for a spec that follows the sequential order of the sampler that is preventing us today from patching src/cmdstan/command.hpp from
@@ -722,7 +722,6 @@ namespace stan {
if (adapt_engaged) {
dynamic_cast<stan::mcmc::base_adapter*>(sampler_ptr)
->disengage_adaptation();
- writer.write_adapt_finish(sampler_ptr);
}
to
@@ -740,6 +739,7 @@ namespace stan {
sampleDeltaT = static_cast<double>(end - start) / CLOCKS_PER_SEC;
writer.write_timing(warmDeltaT, sampleDeltaT);
+ writer.write_adapt_finish(sampler_ptr);
so that the resulting .csv file looks like
# stan_version_major = 2
# stan_version_minor = 7
# stan_version_patch = 0
# model = bernoulli_model
# method = sample (Default)
# sample
# num_samples = 1000 (Default)
# num_warmup = 1000 (Default)
# save_warmup = 1
# thin = 1 (Default)
# adapt
# engaged = 1 (Default)
# gamma = 0.050000000000000003 (Default)
# delta = 0.80000000000000004 (Default)
# kappa = 0.75 (Default)
# t0 = 10 (Default)
# init_buffer = 75 (Default)
# term_buffer = 50 (Default)
# window = 25 (Default)
# algorithm = hmc (Default)
# hmc
# engine = nuts (Default)
# nuts
# max_depth = 10 (Default)
# metric = diag_e (Default)
# stepsize = 1 (Default)
# stepsize_jitter = 0 (Default)
# id = 0 (Default)
# data
# file = bernoulli.data.R
# init = 2 (Default)
# random
# seed = 1198819904
# output
# file = output.csv (Default)
# diagnostic_file = (Default)
# refresh = 100 (Default)
lp__,accept_stat__,stepsize__,treedepth__,n_leapfrog__,n_divergent__,theta
-16.649,5.72792e-17,2,2,3,0,0.83286
... 1998 consecutive lines of comma-delimited numbers like ADVI does ...
-7.024,0.955167,0.956729,1,1,0,0.349523
# Elapsed Time: 0.022471 seconds (Warm-up)
# 0.022418 seconds (Sampling)
# 0.044889 seconds (Total)
# Adaptation terminated
# Step size = 0.956729
# Diagonal elements of inverse mass matrix:
# 0.409482
And that is not even that asequential because the (currently irrelevant, except as a 3rd-order diagnostic) stepsize and diagonal elements of the mass matrix are the same at the end of the sampling as they are at the end of the warmup. And that does not prevent progress toward the future where to become more than a 3rd-order diagonstic the step size and diagonal elements are presumably going to have to be written to a file in the same format as the data file or the init file so that it can be passed back. And it does not fail any unit test in cmdstan, although maybe some interface is currently expecting the adaptation info to be placed in the middle of the file. And it is a net decrease in hackishness. And it fixes an actual problem.
Comment by bob-carpenter Tuesday Aug 18, 2015 at 17:57 GMT
I think th elonger term plan is to put the adaptation info in a separate file altogether. I don't at all understand why it's a problem where it goes, but I don't like it as a comment mixed into an otherwise coherent CSV file.
On Aug 18, 2015, at 7:13 AM, bgoodri notifications@github.com wrote:
Fast loading of CSVs is needed for a shinystan app that uploads a .csv file and creates an object that it can launch on while the meter is running. Moreover, extremely large .csv files are effectively unloadable even locally because it can take hours and thereby force users to do things like https://groups.google.com/d/msg/stan-users/urZeEMjjPHM/R0S41-hQbPMJ . Not-too-large .csv files can be effectively unloadable on a Network File System.
It is only your preference for a spec that follows the sequential order of the sampler that is preventing us today from patching src/cmdstan/command.hpp from
@@ -722,7 +722,6 @@ namespace stan { if (adapt_engaged) { dynamic_caststan::mcmc::base_adapter*(sampler_ptr) ->disengage_adaptation();
- writer.write_adapt_finish(sampler_ptr); }
to
@@ -740,6 +739,7 @@ namespace stan { sampleDeltaT = static_cast
(end - start) / CLOCKS_PER_SEC; writer.write_timing(warmDeltaT, sampleDeltaT);
- writer.write_adapt_finish(sampler_ptr);
so that the resulting .csv file looks like
stan_version_major = 2
stan_version_minor = 7
stan_version_patch = 0
model = bernoulli_model
method = sample (Default)
sample
num_samples = 1000 (Default)
num_warmup = 1000 (Default)
save_warmup = 1
thin = 1 (Default)
adapt
engaged = 1 (Default)
gamma = 0.050000000000000003 (Default)
delta = 0.80000000000000004 (Default)
kappa = 0.75 (Default)
t0 = 10 (Default)
init_buffer = 75 (Default)
term_buffer = 50 (Default)
window = 25 (Default)
algorithm = hmc (Default)
hmc
engine = nuts (Default)
nuts
max_depth = 10 (Default)
metric = diag_e (Default)
stepsize = 1 (Default)
stepsize_jitter = 0 (Default)
id = 0 (Default)
data
file = bernoulli.data.R
init = 2 (Default)
random
seed = 1198819904
output
file = output.csv (Default)
diagnostic_file = (Default)
refresh = 100 (Default)
lp,accept_stat,stepsize,treedepth__,n_leapfrog,n_divergent__,theta -16.649,5.72792e-17,2,2,3,0,0.83286 ... 1998 consecutive lines of comma-delimited numbers like ADVI does ... -7.024,0.955167,0.956729,1,1,0,0.349523
Elapsed Time: 0.022471 seconds (Warm-up)
0.022418 seconds (Sampling)
0.044889 seconds (Total)
Adaptation terminated
Step size = 0.956729
Diagonal elements of inverse mass matrix:
0.409482
And that is not even that asequential because the (currently irrelevant, except as a 3rd-order diagnostic) stepsize and diagonal elements of the mass matrix are the same at the end of the sampling as they are at the end of the warmup. And that does not prevent progress toward the future where to become more than a 3rd-order diagonstic the step size and diagonal elements are presumably going to have to be written to a file in the same format as the data file or the init file so that it can be passed back. And it does not fail any unit test in cmdstan, although maybe some interface is currently expecting the adaptation info to be placed in the middle of the file. And it is a net decrease in hackishness. And it fixes an actual problem.
— Reply to this email directly or view it on GitHub.
Issue by bgoodri Sunday Mar 01, 2015 at 21:07 GMT Originally opened as https://github.com/stan-dev/stan/issues/1321
The current output of the .csv files when you include the warmup period looks like
This format is inefficient in time and RAM to read (e.g. into R) because it has to scan each line as text for the leading pound sign. Also, it does not play nicely with the optimized
fread
function in the data.table package that was apparently used by https://groups.google.com/d/msg/stan-users/urZeEMjjPHM/R0S41-hQbPMJI think it would be better to
Thus, the format would look like
Does anything depend on the absence of blank lines or the presence of the adaptation info being in between the warmup iterations and the sampled iterations?