Transition plan tasks - Githubissues

rplzzz commented 4 years ago

[x] Find where metrosamp tuning parameters for Protocol 48 are recorded.
[x] Document termination criteria.
[x] Document use of multiple Markov chains for each protocol.
[x] Document procedure for concatenating original batch and continuation batch runs into a single dataset that can be used in the analysis functions.
[x] Document workflow for running the MCMC runs.
[x] Archive the old MCMC output data.
[x] Determine whether any of the material on the pkg-data-update branch is still needed.
- I originally created this branch when I thought the PCA code was going to have to be updated to work with the new data. It's not clear that anything on it is still required.
- Update: I don't see anything in the branch that we still need. I'll leve the branch in place in case there is something I missed, but I expect you won't ever have cause to look at it.

rplzzz commented 4 years ago

The MCMC procedure is described in detail in the aptly titled "Monte Carlo Procedure" subsection. On the subject of termination criteria, we say that we ran the MCMC until the Gelman-Rubin statistic (Rhat) was <= 1.1. We also note that for this protocol that happened in the initial batch (i.e., no continuation was necessary).

rplzzz commented 4 years ago

The initial guess and tuned proposal distribution parameters for the MCMC are set automatically in the mc_run_conc and mc_run_emiss functions. The initial guess is the same for all protocols, while the scale factors are protocol dependent.

Here is the logic for the concentration-driven case: https://github.com/kdorheim/hectorcal/blob/a9d49d04c66b9b225d1b4140555f4da4a344edfc/inst/scripts/mc-runs-conc.R#L76-L154 The basic parameters set on line 84 are what is used in Protocol 48, which is the focus of the paper. It is possible to override these defaults by restarting from a saved output file with different values saved in it, but we didn't do that in any of the production runs. Note also that for runs using mean calibration without PCA we specify a covariance matrix, instead of just independent scale factors. That's because in those protocols the posterior distribution is highly correlated, and you just won't get efficient sampling (i.e., lots of rejections unless you take very small steps) without having a correlated proposal distribution.

You can also find this information after the fact by looking at the metrosamp object output from the calculation.

> mcrc48 <- readRDS('analysis/paper2/mcmc-output/conc/hectorcal-25000-48-mcrslt.rds')
> str(mcrc48)
List of 5
 $ samples: num [1:25000, 1:4] 3 3 3 3 2.63 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:4] "S" "diff" "alpha" "volscl"
 $ samplp : num [1:25000] -23.1 -23.1 -23.1 -23.1 -23 ...
 $ accept : num 0.21
 $ plast  : Named num [1:4] 2.524 3.941 0.501 1.231
  ..- attr(*, "names")= chr [1:4] "S" "diff" "alpha" "volscl"
 $ scale  : Named num [1:4] 0.75 1.25 0.5 0.5
  ..- attr(*, "names")= chr [1:4] "S" "diff" "alpha" "volscl"
 - attr(*, "class")= chr [1:2] "metrosamp" "list"
> mcrc48$scale
     S   diff  alpha volscl 
  0.75   1.25   0.50   0.50

rplzzz commented 4 years ago

It turns out the old MCMC output, at least for the concentration-driven runs, is included in the repository, in analysis/paper2/mcmc-output/conc.

Note, however, that in paper2-figs.R we load results from analysis/mcmc/conc/primary. These files are dated 21 Aug. (and are not checked into the repository), while the ones I referred to above are dated 21 Oct, so I believe paper2-figs.R should be updated to point at the more recent files. However, all of this will be moot when you do the reruns with the new dataset anyhow.

Update: It appears that the files in the repository are the concatenated primary and continuation runs from the directory referred to in the analysis script (see 5e76c37) I have archived the data from analysis/mcmc/{conc,emiss} on PIC as /pic/projects/GCAM/rpl/hectorcal-output-archive.tgz.

rplzzz commented 4 years ago

Workflow

How to run the MCMC runs on PIC is described in inst/scripts/readme.md.

We normally start with a 10,000 iteration warm-up run. The output from these runs is not included in the final results. The warm-up serves two purposes. The first is that it allows the Markov chains to disperse a bit from their common starting location. If you don't do this, then you bias the probability density a little bit high in the vicinity of the initial guess. The second purpose is that we can look at the diagnostics from the warm-up run and ensure that the sampling efficiency is what we expected.

The initial batch is run using the output of the warm-up run as a restart file. This is done by setting the hectorcalRESTART environment variable before submitting the batch script (batch scripts run with sbatch inherit environment variables that are set in the shell they are launched from).

For most protocols, the initial batch will be sufficient, but in cases where a continuation is needed, you run the same way we ran the initial batch, but in this case you will set hectorcalRESTART to point at the output from the initial batch, rather than from the warm-up batch.

rplzzz commented 4 years ago

MCMC Results Structure

Results from MCMC runs are read in using the proc_mc_rslts function. It takes a whole directory of output files and produces a list of three elements:

runstats: For each protocol, a list of summary statistics for that protocol's runs (recall that there are 16 independent Markov chains for each protocol -- these statistics apply to the collection). Mostly the results in this slot are there to allow you to take a quick look at how the runs performed and to diagnose whether you need to set up a continuation run. The statistics calculated are:
- accept: The fraction of all proposal steps that were accepted.
- neff: The number of effective samples, after accounting for autocorrelation. There will be a value for each parameter.
- rhat: The Gelman-Rubin statistic (called the "potential scale reduction factor" in the formatted output). There will be a value and confidence interval for each model parameter, plus an overall value called the "multivariate psrf".

mcobjs: The metrosamp objects generated by the MCMC runs. These are indexed by protocol. Within each protocol there is a list of metrosamp objects indexed by the name of the file they were read from. Each of these is one of the 16 independent Markov chains that was run for that protocol. E.g.:

> names(mcruns_conc$mcobjs$`48`)
[1] "analysis/mcmc/conc/primary/hectorcal-conc-100000-48-mcrslt.rds"
[2] "analysis/mcmc/conc/primary/hectorcal-conc-100000-49-mcrslt.rds"
[3] "analysis/mcmc/conc/primary/hectorcal-conc-100000-50-mcrslt.rds"
...
[16] "analysis/mcmc/conc/primary/hectorcal-conc-100000-63-mcrslt.rds"

codaobjs: mcmc.list objects from the CODA package. These contain the same information as the mcobjs; they're just reformatted to be used with some of the analysis functions in CODA.

Most of the analysis functions take a list of metrosamp objects corresponding to the collection of Markov chains run for a single protocol. For example, here's a line from the analysis code that creates a pairplot for Protocol 48: https://github.com/kdorheim/hectorcal/blob/a9d49d04c66b9b225d1b4140555f4da4a344edfc/analysis/paper2/paper2-figs.R#L18 It creates the following plot, which is based on all 16 of the runs listed above:

Concatenating Runs

If you have to do a continuation run, you will concatenate it with the concat_runs function. This function takes two structures returned from proc_mc_rslts, as described above. The first will be the base (i.e., initial) runs, and the second will be the continuation runs. It is not necessary to have continuation runs for all protocol, and generally you won't. Even the base set of runs we do, with 16 Markov chains and 100,000 iterations is overkill for most of the protocols.

The structure returned will be organized just like one of the ones returned by proc_mc_results. The protocols for which there were continuation runs will have the base and continuation runs concatenated into one long set of results, and the runstats fields for these protocols will have been recalculated using the concatenated results. Protocols for which there weren't continuations will be copied unmodified into the return object.

Other packages

Reading back over this, I realized that I just assumed familiarity with metrosamp and CODA. In case you haven't used them, metrosamp is one of ours. You can find it at JGCRI/metrosamp. Its main function is to do the MCMC sampling using the Metropolis algorithm, but it has some rudimentary analysis functions baked in. Note that although it purports to provide batch sampling, that functionality is buggy (JGCRI/metrosamp#1), so don't try to use that.

CODA is available on CRAN. It provides some more advanced analysis functions, such as the N_eff and Rhat statistics.

rplzzz commented 4 years ago

Run ID numbers and multiple Markov chains

Each run has an ID number from 0-127 (or from 128-255, which produces the same runs, but with extra debugging output). The meaning of the the run ID is explained in the documentation to encode_runid and decode_runid. Briefly, the first four bits in the binary representation of the run ID serve as flags that turn on or off various features of the calibration protocol. Thus, run 48 (b00110000) has the first two flags (debug and mean calibration) cleared, and the second two (PCA and heat flux) set.

The last four bits don't affect the calibration protocol. Instead, they allow us to run multiple independent Markov chains for each protocol by providing a serial number for each chain. This serial number is used to select a different random number seed for each chain, which produces an independent chain. So, for each protocol, there will be 16 runs that all provide results for that protocol. The protocol itself is named for the lowest run ID that implements that protocol. For example, run IDs 48-63 all implement the same protocol, which we refer to as Protocol 48. The proc_mc_rslts function reads all of the runs in a directory and groups them by protocol for analysis.

rplzzz commented 4 years ago

@kdorheim I believe this covers everything we talked about. Please let me know if there is anything I missed.

kdorheim commented 4 years ago

@rplzzz thanks this was very helpful and I am pretty sure covers more than we discussed last week.

kdorheim commented 4 years ago

@claudiatebaldi Robert documented the MCMC code here and the data lives on pic at /pic/projects/GCAM/rpl/hectorcal-output-archive.tgz. If you have any questions please let me know. If you are unable to access the /pic/projects/GCAM/rpl directory, email pic support and they will be able to give you access.

claudiatebaldi commented 4 years ago

Perfect, thank you!

From: "Kalyn R. Dorheim" notifications@github.com Reply-To: kdorheim/hectorcal reply@reply.github.com Date: Monday, May 4, 2020 at 5:03 PM To: kdorheim/hectorcal hectorcal@noreply.github.com Cc: "Tebaldi, Claudia" claudia.tebaldi@pnnl.gov, Mention mention@noreply.github.com Subject: Re: [kdorheim/hectorcal] Transition plan tasks (#33)

@claudiatebaldihttps://protect2.fireeye.com/v1/url?k=f2ef23bb-ae5a1c02-f2ef09ae-0cc47adc5fce-ded7261ab818653d&q=1&e=a3ec4f7c-1b59-494b-b680-85f763bcaef7&u=https%3A%2F%2Fgithub.com%2Fclaudiatebaldi Robert documented the MCMC code here and the data lives on pic at /pic/projects/GCAM/rpl/hectorcal-output-archive.tgz. If you have any questions please let me know. If you are unable to access the /pic/projects/GCAM/rpl directory, email pic support and they will be able to give you access.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://protect2.fireeye.com/v1/url?k=5dfcfa25-0149c59c-5dfcd030-0cc47adc5fce-2cc6ead8c44560b6&q=1&e=a3ec4f7c-1b59-494b-b680-85f763bcaef7&u=https%3A%2F%2Fgithub.com%2Fkdorheim%2Fhectorcal%2Fissues%2F33%23issuecomment-623704146, or unsubscribehttps://protect2.fireeye.com/v1/url?k=f3746d7c-afc152c5-f3744769-0cc47adc5fce-59b883f8cdbe7738&q=1&e=a3ec4f7c-1b59-494b-b680-85f763bcaef7&u=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAFZSZZRJHLOABD5LJQZKTSDRP4UPZANCNFSM4LCXTSUQ.

claudiatebaldi / hectorcal

Transition plan tasks #33

Workflow

MCMC Results Structure

Concatenating Runs

Other packages

Run ID numbers and multiple Markov chains