TuringLang / DynamicPPL.jl

Implementation of domain-specific language (DSL) for dynamic probabilistic programming
https://turinglang.org/DynamicPPL.jl/
MIT License
165 stars 29 forks source link

Data subsampling without reinstantiating the model #208

Open torfjelde opened 3 years ago

torfjelde commented 3 years ago

There are several cases where one might want to subsample the data/observations used in a model between each MCMC iteration, e.g. stochastic HMC https://github.com/TuringLang/Turing.jl/pull/1502, stochastic ADVI.

As of right now, this would require re-instantiating the model between every call to AbstractMCMC.step. This seems to me to be a bit redundant? I honestly haven't tested this properly, so maybe it would even be possible to just do re-instantiate the model between each call to step while re-using the VarInfo?

Otherwise we need some approach to do data subsampling without re-instantiating the model, very likely using a context (combined with the MiniBatchContext to appropriately re-scale the logpdf).

Desired properties:

  1. Ability to sub-select data in an arbitrary manner, e.g. the filtering mechanism could be random, it could be dynamically sized (number of observations used changes between iterations).
  2. Support lazy-loading of data, e.g. dataset is too large to fit into memory.
torfjelde commented 3 years ago

would it be possible to make proper stochastic algorithms, i.e. do data-subsampling?

I thought about this and was surprised that it was implemented but I didn't include it here since it was not part of the old implementation and it requires some additional thought and discussion which I assumed would be better to have in a separate PR or issue.

Mainly I thought about two approaches, both based on a special context and some special observe implementations to handle multiple observations in one ~ expression. One could try to extract the number of observations by running the fully sampled model once and then use the context to only sample one index (or a batch of indices) in every run and only include the corresponding observations. However, I assume that stochastic control flow would break this approach. As an alternative, one could instead specify the probability of an observation to be included (similar to dropout) and just decide randomly in each observe call if a (subset of) observation(s) will be included. This does not require a fixed number of observations and should work for all kind of models but there is a non-zero probability that all observations would be dropped - in this case it is not possible to normalize the (gradient of the) log probability with the number of considered observations.

From https://github.com/TuringLang/Turing.jl/pull/1502#issuecomment-751298197 by @devmotion

torfjelde commented 3 years ago

One could try to extract the number of observations by running the fully sampled model once and then use the context to only sample one index (or a batch of indices) in every run and only include the corresponding observations. However, I assume that stochastic control flow would break this approach.

Hmm, yeah this could work pretty well, as you said, for static models. I was thinking of extracting the data-containers, and then just doing inplace mutations of the container. But just sampling the symbols directly is a better idea.

As an alternative, one could instead specify the probability of an observation to be included (similar to dropout) and just decide randomly in each observe call if a (subset of) observation(s) will be included

Surely this could fall under a more general "filter VarName" context which simply as a filtering function specified by the user? This filter could be random, deterministic, etc.

One issue with both approaches: neither handles the case where the full data set is to large to have in memory, and ideally you'd lazily load the batch. An alternative that might handle this is a "data-replacement" context which will essentially function as a merge, i.e. if VarName is present in context, then use value from context, otherwise use value "encoded" in the model. This would still require you to be able to hold 2 * batch_size in memory, but it's at least waaay better than the full dataset.

If combined with a filtering context as described above, we could also support dynamically sized data as long as the largest possible batch could be kept in memory.

devmotion commented 3 years ago

As of right now, this would require re-instantiating the model between every call to AbstractMCMC.step.

This is only possible if observations are provided as arguments to the model which is not guaranteed in general and not even the case for the standard models that we use in our tests.

devmotion commented 3 years ago

By the way, at this stage I think this should be addressed in Turing rather than in DynamicPPL by defining a custom context for SGHMC and SGLD, similar to the custom context for MLE and MAP optimization.

torfjelde commented 3 years ago

This is only possible if observations are provided as arguments to the model which is not guaranteed in general and not even the case for the standard models that we use in our tests.

That's a good point! But I also don't think that's too much of a stringent requirement to be able to use data-subsampling. It would be very natural to just have the user provide a method with identical signature to the model which would do inplace mutations of the arguments.

torfjelde commented 3 years ago

By the way, at this stage I think this should be addressed in Turing rather than in DynamicPPL by defining a custom context for SGHMC and SGLD, similar to the custom context for MLE and MAP optimization.

Hmm, I guess it might be better to put it there. You mean "for now" while we are in a bit of an experimentation phase? Btw we already have a MiniBatchContext in DynamicPPL, which seems like a setup for having something like this.

devmotion commented 3 years ago

One issue with both approaches: neither handles the case where the full data set is to large to have in memory, and ideally you'd lazily load the batch. An alternative that might handle this is a "data-replacement" context which will essentially function as a merge, i.e. if VarName is present in context, then use value from context, otherwise use value "encoded" in the model. This would still require you to be able to hold 2 * batch_size in memory, but it's at least waaay better than the full dataset.

I think it would be easier to just use lazy datastructures for the observations (an array that loads data only if accessed via getindex) in such a case instead of handling this case in DynamicPPL or Turing.

devmotion commented 3 years ago

Surely this could fall under a more general "filter VarName" context which simply as a filtering function specified by the user? This filter could be random, deterministic, etc.

In some sense, yes. But we also want to subsample hardcoded observations which don't have a VarName expression, and we want to subsample VarNames, i.e. e.g. only include 1 out of 10 observations present in variable y.

torfjelde commented 3 years ago

I think it would be easier to just use lazy datastructures for the observations (an array that loads data only if accessed via getindex) in such a case instead of handling this case in DynamicPPL or Turing.

That's a great idea! Though it seems like it might require a bit of work.

devmotion commented 3 years ago

Hmm, I guess it might be better to put it there. You mean "for now" while we are in a bit of an experimentation phase?

Yes, I think we should come up with a good design that works for SGHMC and SGLD and test it properly in Turing. If needed it can be moved to DynamicPPL (or somewhere else) and maybe be generalized. It is definitely easier to iterate and experiment, if we do not have to sync DynamicPPL and Turing initially.

Btw we already have a MiniBatchContext in DynamicPPL, which seems like a setup for having something like this.

IIRC this one is quite different, it just rescales the resulting log probabilities, doesn't it?

torfjelde commented 3 years ago

Yes, I think we should come up with a good design that works for SGHMC and SGLD and test it properly in Turing. If needed it can be moved to DynamicPPL (or somewhere else) and maybe be generalized. It is definitely easier to iterate and experiment, if we do not have to sync DynamicPPL and Turing initially.

Agree :+1:

IIRC this one is quite different, it just rescales the resulting log probabilities, doesn't it?

I just meant that the motivation behind MiniBatchContext and whatever we're doing has a big overlap:) But yes, it's of course much simpler.

torfjelde commented 3 years ago

In some sense, yes. But we also want to subsample hardcoded observations which don't have a VarName expression, and we want to subsample VarNames, i.e. e.g. only include 1 out of 10 observations present in variable y.

True, but I meant a function rather than a lookup so that this could just also be included in the filtering context where the filter method simply returns true with some probability.

bgroenks96 commented 3 years ago

If I remember correctly, the way that I accomplished this in Tensorflow Probability was to use tf.Dataset in the log probability function. tf.Dataset is a general data structure that is able to lazy load sharded data on demand and apply preprocessing operations on the fly.

I'm not sure what the equivalent of this is in the Julia ecosystem (I'm still relatively new here), but I think this would be the ideal way to go; i.e. use a lazy/generic data structure in your @model function and then just take random batches in whatever manner is appropriate for the problem.

I highly recommend checking out the tf.Dataset API if you guys aren't familiar with it. Maybe Turing could provide a similar interface that would be able to handle multiple different kinds of backing stores (i.e. AbstractArray, DistributedArray, something from LazyArrays.jl, or just a plain old generator function).

bgroenks96 commented 3 years ago

My main point is that I think you should avoid building the actual subsampling procedure into the Turing sampler, model context, etc, as I think @devmotion suggested in his original two proposals (although perhaps I misunderstood). Different kinds of problems require different kinds of subsampling, so its best to be as flexible as possible for the modeler and leave this up to them.

devmotion commented 3 years ago

We need some structure or algorithm and can't leave this only up to the user. In contrast to when you implement the log probability function yourself, the Turing @model does not provide a generic way to perform subsampling since there is no clear distinction between data/observations and latent variables/assumptions. Observations don't have to be arguments of the model, and even if you specify observations as arguments to the model, they are treated as latent variables if they are missing or e.g. a vector of missings.

Of course, it possible for users to explicitly subsample observations in the model definition but then the model can't be used with other samplers and can't be used for evaluating the actual log probabilities without stochastic subsampling.

Contexts provide a way to inject the subsampling procedure in the model execution only when desired, and regardless of how the observations and latent variables are specified.

That being said, I definitely agree that we should make as few assumptions as possible and let the modelers decide which (possibly lazy) datastructures they use for specifying the observations. This was the intention behind my comment above:

I think it would be easier to just use lazy datastructures for the observations (an array that loads data only if accessed via getindex) in such a case instead of handling this case in DynamicPPL or Turing.

However, this is orthogonal to the implementation of subsampling with contexts. I assume also that there are already implementations and packages for lazy data loading (e.g. JuliaDB or mmap with HDF5?), so ideally users should just be able to use them and I don't think Turing should focus on this part.

bgroenks96 commented 3 years ago

I'm afraid I can't comment further without a better understanding of Contexts. Is there documentation for this or do I need to just read the source code?

devmotion commented 3 years ago

There's some explanation here: https://turing.ml/dev/docs/for-developers/compiler

The main thing is just that one can provide a context when executing a model, and this context is forwarded to all observe and assume calls (i.e., the function calls to which ~ are rewritten). It is then possible to implement observe and assume (or some other functions they call) for a context to achieve a different behaviour when the model is executed. E.g. PriorContext only accumulates the log probability in assume calls, LikelihoodContext only in observe calls, and the DefaultContext in both.

torfjelde commented 3 years ago

So something like Stochastic HMC would just be:

using Turing, Random, DynamicPPL

@model function demo(n=1, ::Type{TV} = Vector{Float64}) where {TV}
    m ~ Normal()
    x = TV(undef, n)
    for i in eachindex(x)
        x[i] ~ Normal(m, 1)
    end
end

data = randn(100)
batch_size = 10

model = demo(batch_size)

spl = DynamicPPL.Sampler(HMC(1e-3, 10))

# Initialize.
transition, state = let batch = data[1:batch_size]
    AbstractMCMC.step(rng, condition(model, x=batch), spl)
end

ts = AbstractMCMC.samples(transition, model, spl, 1)

# Iterate over all the batches.
for batch in Iterators.partition(shuffle(rng, data), batch_size)
    cmodel = condition(model, x=batch)
    transition, state = AbstractMCMC.step(rng, cmodel, spl, state)
    AbstractMCMC.save!!(ts, transition, length(ts) + 1, model, spl)
end

(Copy-paste from https://github.com/TuringLang/DynamicPPL.jl/issues/165#issuecomment-964307145)

bgroenks96 commented 3 years ago

Uh... yeah except you need the friction term for it to be actual Stochastic Gradient HMC ;)

torfjelde commented 3 years ago

Uh... yeah except you need the friction term for it to be actual Stochastic Gradient HMC ;)

I don't think the term "stochastic gradient HMC" is necessarily reserved for the version that includes a friction term :shrug: Even in the paper talking about this, they refer to all these methdos as "stochastic gradient HMC" but distinguish by calling the above the "naive" one and the one you're referring to is "with friction".

And just for the record, I don't advocate for using either approach :sweat_smile: Neither is going to be pi-invariant in practice (pi being your target).

bgroenks96 commented 3 years ago

Ok, fair enough. But it still might cause some confusion!

It's also rather funny calling a sampling algorithm "stochastic"; of course this is meant as stochastic w.r.t the data but even plain HMC has stochasticity in the momentum proposal distribution.

torfjelde commented 3 years ago

Yup, I really don't like the "stochastic" naming; it's the same with VI where we already have stochasticity because we're using an empirical estimate of ELBO but here it also refers to data-subsampling.