Open torfjelde opened 3 years ago
would it be possible to make proper stochastic algorithms, i.e. do data-subsampling?
I thought about this and was surprised that it was implemented but I didn't include it here since it was not part of the old implementation and it requires some additional thought and discussion which I assumed would be better to have in a separate PR or issue.
Mainly I thought about two approaches, both based on a special context and some special
observe
implementations to handle multiple observations in one~
expression. One could try to extract the number of observations by running the fully sampled model once and then use the context to only sample one index (or a batch of indices) in every run and only include the corresponding observations. However, I assume that stochastic control flow would break this approach. As an alternative, one could instead specify the probability of an observation to be included (similar to dropout) and just decide randomly in eachobserve
call if a (subset of) observation(s) will be included. This does not require a fixed number of observations and should work for all kind of models but there is a non-zero probability that all observations would be dropped - in this case it is not possible to normalize the (gradient of the) log probability with the number of considered observations.
From https://github.com/TuringLang/Turing.jl/pull/1502#issuecomment-751298197 by @devmotion
One could try to extract the number of observations by running the fully sampled model once and then use the context to only sample one index (or a batch of indices) in every run and only include the corresponding observations. However, I assume that stochastic control flow would break this approach.
Hmm, yeah this could work pretty well, as you said, for static models. I was thinking of extracting the data-containers, and then just doing inplace mutations of the container. But just sampling the symbols directly is a better idea.
As an alternative, one could instead specify the probability of an observation to be included (similar to dropout) and just decide randomly in each observe call if a (subset of) observation(s) will be included
Surely this could fall under a more general "filter VarName
" context which simply as a filtering function specified by the user? This filter
could be random, deterministic, etc.
One issue with both approaches: neither handles the case where the full data set is to large to have in memory, and ideally you'd lazily load the batch. An alternative that might handle this is a "data-replacement" context which will essentially function as a merge
, i.e. if VarName
is present in context, then use value from context, otherwise use value "encoded" in the model. This would still require you to be able to hold 2 * batch_size
in memory, but it's at least waaay better than the full dataset.
If combined with a filtering context as described above, we could also support dynamically sized data as long as the largest possible batch could be kept in memory.
As of right now, this would require re-instantiating the model between every call to AbstractMCMC.step.
This is only possible if observations are provided as arguments to the model which is not guaranteed in general and not even the case for the standard models that we use in our tests.
By the way, at this stage I think this should be addressed in Turing rather than in DynamicPPL by defining a custom context for SGHMC and SGLD, similar to the custom context for MLE and MAP optimization.
This is only possible if observations are provided as arguments to the model which is not guaranteed in general and not even the case for the standard models that we use in our tests.
That's a good point! But I also don't think that's too much of a stringent requirement to be able to use data-subsampling. It would be very natural to just have the user provide a method with identical signature to the model which would do inplace mutations of the arguments.
By the way, at this stage I think this should be addressed in Turing rather than in DynamicPPL by defining a custom context for SGHMC and SGLD, similar to the custom context for MLE and MAP optimization.
Hmm, I guess it might be better to put it there. You mean "for now" while we are in a bit of an experimentation phase? Btw we already have a MiniBatchContext
in DynamicPPL, which seems like a setup for having something like this.
One issue with both approaches: neither handles the case where the full data set is to large to have in memory, and ideally you'd lazily load the batch. An alternative that might handle this is a "data-replacement" context which will essentially function as a merge, i.e. if VarName is present in context, then use value from context, otherwise use value "encoded" in the model. This would still require you to be able to hold 2 * batch_size in memory, but it's at least waaay better than the full dataset.
I think it would be easier to just use lazy datastructures for the observations (an array that loads data only if accessed via getindex
) in such a case instead of handling this case in DynamicPPL or Turing.
Surely this could fall under a more general "filter VarName" context which simply as a filtering function specified by the user? This filter could be random, deterministic, etc.
In some sense, yes. But we also want to subsample hardcoded observations which don't have a VarName
expression, and we want to subsample VarName
s, i.e. e.g. only include 1 out of 10 observations present in variable y
.
I think it would be easier to just use lazy datastructures for the observations (an array that loads data only if accessed via getindex) in such a case instead of handling this case in DynamicPPL or Turing.
That's a great idea! Though it seems like it might require a bit of work.
Hmm, I guess it might be better to put it there. You mean "for now" while we are in a bit of an experimentation phase?
Yes, I think we should come up with a good design that works for SGHMC and SGLD and test it properly in Turing. If needed it can be moved to DynamicPPL (or somewhere else) and maybe be generalized. It is definitely easier to iterate and experiment, if we do not have to sync DynamicPPL and Turing initially.
Btw we already have a MiniBatchContext in DynamicPPL, which seems like a setup for having something like this.
IIRC this one is quite different, it just rescales the resulting log probabilities, doesn't it?
Yes, I think we should come up with a good design that works for SGHMC and SGLD and test it properly in Turing. If needed it can be moved to DynamicPPL (or somewhere else) and maybe be generalized. It is definitely easier to iterate and experiment, if we do not have to sync DynamicPPL and Turing initially.
Agree :+1:
IIRC this one is quite different, it just rescales the resulting log probabilities, doesn't it?
I just meant that the motivation behind MiniBatchContext
and whatever we're doing has a big overlap:) But yes, it's of course much simpler.
In some sense, yes. But we also want to subsample hardcoded observations which don't have a VarName expression, and we want to subsample VarNames, i.e. e.g. only include 1 out of 10 observations present in variable y.
True, but I meant a function rather than a lookup so that this could just also be included in the filtering context where the filter
method simply returns true
with some probability.
If I remember correctly, the way that I accomplished this in Tensorflow Probability was to use tf.Dataset
in the log probability function. tf.Dataset
is a general data structure that is able to lazy load sharded data on demand and apply preprocessing operations on the fly.
I'm not sure what the equivalent of this is in the Julia ecosystem (I'm still relatively new here), but I think this would be the ideal way to go; i.e. use a lazy/generic data structure in your @model
function and then just take random batches in whatever manner is appropriate for the problem.
I highly recommend checking out the tf.Dataset
API if you guys aren't familiar with it. Maybe Turing could provide a similar interface that would be able to handle multiple different kinds of backing stores (i.e. AbstractArray
, DistributedArray
, something from LazyArrays.jl
, or just a plain old generator function).
My main point is that I think you should avoid building the actual subsampling procedure into the Turing sampler, model context, etc, as I think @devmotion suggested in his original two proposals (although perhaps I misunderstood). Different kinds of problems require different kinds of subsampling, so its best to be as flexible as possible for the modeler and leave this up to them.
We need some structure or algorithm and can't leave this only up to the user. In contrast to when you implement the log probability function yourself, the Turing @model
does not provide a generic way to perform subsampling since there is no clear distinction between data/observations and latent variables/assumptions. Observations don't have to be arguments of the model, and even if you specify observations as arguments to the model, they are treated as latent variables if they are missing
or e.g. a vector of missing
s.
Of course, it possible for users to explicitly subsample observations in the model definition but then the model can't be used with other samplers and can't be used for evaluating the actual log probabilities without stochastic subsampling.
Contexts provide a way to inject the subsampling procedure in the model execution only when desired, and regardless of how the observations and latent variables are specified.
That being said, I definitely agree that we should make as few assumptions as possible and let the modelers decide which (possibly lazy) datastructures they use for specifying the observations. This was the intention behind my comment above:
I think it would be easier to just use lazy datastructures for the observations (an array that loads data only if accessed via getindex) in such a case instead of handling this case in DynamicPPL or Turing.
However, this is orthogonal to the implementation of subsampling with contexts. I assume also that there are already implementations and packages for lazy data loading (e.g. JuliaDB or mmap with HDF5?), so ideally users should just be able to use them and I don't think Turing should focus on this part.
I'm afraid I can't comment further without a better understanding of Contexts. Is there documentation for this or do I need to just read the source code?
There's some explanation here: https://turing.ml/dev/docs/for-developers/compiler
The main thing is just that one can provide a context when executing a model, and this context is forwarded to all observe
and assume
calls (i.e., the function calls to which ~
are rewritten). It is then possible to implement observe
and assume
(or some other functions they call) for a context to achieve a different behaviour when the model is executed. E.g. PriorContext
only accumulates the log probability in assume
calls, LikelihoodContext
only in observe
calls, and the DefaultContext
in both.
So something like Stochastic HMC would just be:
using Turing, Random, DynamicPPL
@model function demo(n=1, ::Type{TV} = Vector{Float64}) where {TV}
m ~ Normal()
x = TV(undef, n)
for i in eachindex(x)
x[i] ~ Normal(m, 1)
end
end
data = randn(100)
batch_size = 10
model = demo(batch_size)
spl = DynamicPPL.Sampler(HMC(1e-3, 10))
# Initialize.
transition, state = let batch = data[1:batch_size]
AbstractMCMC.step(rng, condition(model, x=batch), spl)
end
ts = AbstractMCMC.samples(transition, model, spl, 1)
# Iterate over all the batches.
for batch in Iterators.partition(shuffle(rng, data), batch_size)
cmodel = condition(model, x=batch)
transition, state = AbstractMCMC.step(rng, cmodel, spl, state)
AbstractMCMC.save!!(ts, transition, length(ts) + 1, model, spl)
end
(Copy-paste from https://github.com/TuringLang/DynamicPPL.jl/issues/165#issuecomment-964307145)
Uh... yeah except you need the friction term for it to be actual Stochastic Gradient HMC ;)
Uh... yeah except you need the friction term for it to be actual Stochastic Gradient HMC ;)
I don't think the term "stochastic gradient HMC" is necessarily reserved for the version that includes a friction term :shrug: Even in the paper talking about this, they refer to all these methdos as "stochastic gradient HMC" but distinguish by calling the above the "naive" one and the one you're referring to is "with friction".
And just for the record, I don't advocate for using either approach :sweat_smile: Neither is going to be pi-invariant in practice (pi being your target).
Ok, fair enough. But it still might cause some confusion!
It's also rather funny calling a sampling algorithm "stochastic"; of course this is meant as stochastic w.r.t the data but even plain HMC has stochasticity in the momentum proposal distribution.
Yup, I really don't like the "stochastic" naming; it's the same with VI where we already have stochasticity because we're using an empirical estimate of ELBO but here it also refers to data-subsampling.
There are several cases where one might want to subsample the data/observations used in a model between each MCMC iteration, e.g. stochastic HMC https://github.com/TuringLang/Turing.jl/pull/1502, stochastic ADVI.
As of right now, this would require re-instantiating the model between every call to
AbstractMCMC.step
. This seems to me to be a bit redundant? I honestly haven't tested this properly, so maybe it would even be possible to just do re-instantiate the model between each call tostep
while re-using theVarInfo
?Otherwise we need some approach to do data subsampling without re-instantiating the model, very likely using a context (combined with the
MiniBatchContext
to appropriately re-scale the logpdf).Desired properties: