StanJulia / StanSample.jl

WIP: Wrapper package for the sample method in Stan's cmdstan executable.
MIT License
18 stars 4 forks source link

Simplify API for seed #45

Closed itsdfish closed 2 years ago

itsdfish commented 3 years ago

Hi Rob,

I was wondering whether it would be possible to simplify the API for setting a random seed. Currently, it requires

SampleModel(
  ...;
  seed = StanSample.StanBase.RandomSeed(seed),
  ...
)

I'm there are a few alternatives depending on the reason the seed is wrapped in RandomSeed. If RandomSeed is not necessary, it would be treated as an integer with a default value. Alternatively, we could set a default keyword value in the constructor and wrap that integer in RandomSeed inside the constructor. Would you be amenable to that?

It seems like a similar approach could be used for the other keywords, but seed might benefit the most from a simpler API.

itsdfish commented 3 years ago

You also noted in a previous issue that n_chains::Vector{Int} could be changed to n_chains::Int. I could help with any of these changes if it is not too involved.

goedman commented 3 years ago

Very interesting question! I (forever and ever!) kind of parsed the Stan Language structure but likely it is time to review that. I'll do a couple of tests. Like your chains question, not sure if we can update the SampleModel fields directly, but I'll try it.

goedman commented 3 years ago

So I wonder if we should all make them keyword arguments with defaults. I think that would be a significant improvement:

sm = SampleModel("m4_1s", stan4_1; seed=-1, nsamples=1000, delta=0.85, nchains=4, ...)
itsdfish commented 3 years ago

Yeah. I think that would be the best and simplest solution.

On Wed, Oct 13, 2021, 7:48 PM Rob J Goedman @.***> wrote:

So I wonder if we should all make them keyword arguments with defaults. I think that would be a significant improvement:

sm = SampleModel("m4_1s", stan4_1; seed=-1, nsamples=1000, delta=0.85, nchains=4, ...)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/StanJulia/StanSample.jl/issues/45#issuecomment-942803388, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJ2TAGBAILB4NFJR7EBBPDUGYLERANCNFSM5F6JBV2Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

goedman commented 3 years ago

I'll have a look over the next couple of days what such a change entails.

This is code that that has been around for 10 years so I want to be careful. It also relates to my remark here.

In your example above, is your objection agains the use of RandomSeed() or the fact that RandomSeed is not exported and you have to use StanSample.StanBase.RandomSeed(seed)? Just wondering if this is a possible quick fix until the bigger update becomes available. Or, as you suggest, make the type a union and deal with it in the SampleModel call.

itsdfish commented 3 years ago

I understand. It would be a big change. Fortunately, it would only require a few minor changes in your statistical rethinking code because you typically rely on default values.

My objection concerns both issues. Currently, changing the seed requires typing out a very verbose command to make a simple change. Of course, this could be mitigated to some extent by exporting RandomSeed. However, I would regard RandomSeed as an implementation detail that should be hidden from the user. I would only require users to wrap settings in structs if it makes the code more general through dispatch. Otherwise, I would suggest passing primitive types as keywords.

Depending on how you designed the package, I think the same might apply to Sample(). Wrapping the settings of the sampler in a struct imposes additional mental burden on the user. I would consider using simple keywords instead. However, I think there could be some reasons to keep those bundled inside Sample. If you have multiple sampler types on which you are dispatching I think it makes sense to keep certain settings in Sample and pass it to SampleModel because it makes the code more flexible and extendable. Similarly, if you think Stan might add new samplers in the future, using Sample is a good choice. I cannot imagine this being the case for seed however.

Here are my recommendations. If you are not using Sample for dispatch, consider the following:

function SampleModel(
    name::AbstractString,
    model::AbstractString;
    n_chains = 4, # I think n_chains should be a keyword
    seed = -1,
    init = StanBase.Init(),
    output = StanBase.Output(),
    tmpdir = mktempdir(),
    summary::Bool = true,
    printsummary::Bool = false,
    # consider  destructuring Sample() if you are not using it for dispatch
    num_samples,
    num_warmup,
    # and so forth
  )

  # optionally wrap keywords in structs as needed for internal use

end

If you are using Sample to make the code more flexible, then I would recommend this:

function SampleModel(
    name::AbstractString,
    model::AbstractString;
    n_chains = 4, # I think n_chains should be a keyword
    seed = -1,
    init = StanBase.Init(),
    output = StanBase.Output(),
    tmpdir = mktempdir(),
    summary::Bool = true,
    printsummary::Bool = false,
    # keep sample if you are using it for dispatch because it makes the code more general
    method = Sample()
  )

  # optionally wrap seed in RandomSeed if needed for internal use
end

In both cases, n_chains and seed are integer keywords.

goedman commented 3 years ago

Thanks for this. Have been mulling over this and I now think I understand your suggestions.

Currently, in SampleModel(), Stan's Sample() method is not used for (Julia) dispatching. Sample() reflects Stan's way to call sampling functionality in the single cmdstan binary for drawing values. And similarly Optimize() for map estimates, etc. Way back I took that route because I had no idea how stable/variable these interfaces to cmdstan would be.

My plan is to revisit this part of your suggestion after addressing the seed::Int question. This will then also allow a new keyword stan_num_threads in the future and make the overall cmdstan stages hopefully clearer (compile, sample, read .csv/.json result files).

In the call to stan_sample I think I can easily allow a simple Int value and wrap it. I need to reread Stan's docs to refresh what happens if the seed !== -1 in the different chains. What is your use case? Do you set a seed and run a single chain? Is the effect different if you call Random.seed!(...) before calling stan_sample(...)?

On another topic, over the years (I think :-) you have been the only one asking questions about dealing with saved warm-ups (even Stan's stansummary binary fails in that case). Do you still use that? The reason I ask is because if say n_chains becomes a keyword I can query KeyedArray based chains for the number of chains. Similar as in MCMCChains. But warm-up samples is a bit more involved.

Again, thanks for your help!

goedman commented 3 years ago

Hi Chris, just pushed StanSample.jl v4.3.0 which allows kwargs seed and n_chains, e.g.

rc = stan_sample(model; data, seed=123, n_chains=5)
itsdfish commented 3 years ago

Hi Rob,

These changes look good!

To answer your questions above, I usually run 4 chains and set a seed to make it reproducible. If I am doing a parameter recovery simulation to validate the model, I generate simulated data in Julia and set the same seed in julia via Random.seed!. From what I can tell, it is sufficient to reproduce the posterior samples by passing a single seed value (I cannot think of a case where one would need to specify the seed of each chain). I typically do not save warmup samples, unless I need them to diagnose a model that is not converging. So I'm not sure whether it is worth your trouble.

Now that I understand your design better, it seems like you could destructure Sample into keyword arguments for SampleModel as I originally proposed. This would greatly simplify the API, much like changing seed to an integer keyword. There are different ways to use dispatch if you need to accommodate a new sample in the future. For example, you could do something like this:

abstract type SampleModel end

struct NUTSModel <: SampleModel
...
end

struct NewModel <: SampleModel
...
end

function stan_sample(stan_model::SampleModel; kwargs...)
...
end

This would allow you the flexibility to add new samplers and also destructure the fields of Sample into keywords for a simpler API. So I think I was incorrect earlier when I thought there was a tradeoff between dispatch and simplifying the API.

goedman commented 3 years ago

Hi Chris,

In StanSample.jl v5 I will destructure the Stan structs and switch to cmdstan 2-28-0.

I do something similar to what you describe above. I have

abstract type CmdStanModels end

and

mutable struct SampleModel <: CmdStanModels
    name::AbstractString; # Name of the Stan program
    model::AbstractString; # Stan language model program
...
   method::Sample
end

mutable struct OptimizeModel <: CmdStanModels
   @shared_fields_stanmodels # This usage of a macro will be removed in v5
   method::Optimize
end

... etc.

but opted to use stan_sample(), stan_optimize(), stan_variational() etc. on user level. Currently they all call stan_sample() in StanBase.jl (should have been called stan_run()). But that is going to change in v5, as I did yesterday for StanSample.jl v4.3.0, together with the destructuring and enabling the use of C++ threads and map_rect etc.

itsdfish commented 3 years ago

Awesome! I look forward to the next version.

goedman commented 2 years ago

Version 5.0 has been released and should fix this issue.

itsdfish commented 2 years ago

I just upgraded. Very nice!