multivariable distributions - T and P separate or together

abigailsnyder commented 4 years ago

@claudiatebaldi @kdorheim In trying to work through the code in more depth for doing this enhancement https://github.com/JGCRI/an2month/issues/16,

I've done more careful, line by line combing through the nested functions in data_raw/L3_fit_dirichlet_params.R and data_raw/jobrun.zsh. I think that the code is estimating the parameters of a multivariable beta distribution for the temperature data, and a separate set of parameters for the precipitation data. At least I think.

I didn't catch it in my initial trying to learn the an2month package, I think because of how the functions are nested. And because I think that approach of treating T and P separately is different from the very early notes I had contributing to figuring out what the sampling should look like (around Dec 2018) and then I wasn't involved in the actual work. And then so many issues came up with how fldgen was being called in the pipeline, I didn't return to this until last week/this week.

So do we want to keep T and P separate the way they're implemented, or do we want to estimate 24 parameters together (like I initially thought was happening)? Also thoughts on continuing to use a multivariate beta distribution?

abigailsnyder commented 4 years ago

per @claudiatebaldi would expect jointly estimated and jointly sampled.

@abigailsnyder will

[ ] update the code in data_raw to jointly estimate - 24 parameters to get a joint multivariable beta distribution of T and P fractions. And add more documentation to those functions.
[ ] refit the models.
[ ] open a PR

Then go back into the monthly_downscaling code and update sampling to be joint, as well as adding options outlined in https://github.com/JGCRI/an2month/issues/16

abigailsnyder commented 4 years ago

In terms of updating the sampling to be joint, it looks like the separate sampling for each variable is happening in the cassandra components code: https://github.com/JGCRI/cassandra/blob/master/cassandra/components.py Lines 964-977

Which explains why it's harder to tell from the R monthly_downscaling sampling code that the variables are being treated separately than in the data_raw/... training code.

So the R code will have to be updated for the sampling but then the cassandra code will also have to be updated, FYI @crvernon

JGCRI / an2month

multivariable distributions - T and P separate or together #17