bayesiains / nflows

Normalizing flows in PyTorch
MIT License
845 stars 118 forks source link

Higher dimensional data leads to exploding likelihood #22

Closed cschoeller closed 4 years ago

cschoeller commented 4 years ago

Hey,

while playing around I stumbled upon some behavior that I can't quite explain to myself. I implemented the flow

from nflows import transforms, distributions, flows
from nflows.transforms.autoregressive import MaskedPiecewiseRationalQuadraticAutoregressiveTransform as AutoregRQS

modules = []
for i in range(n_layers):
    modules.extend([AutoregRQS(features=24, num_bins=10, hidden_features=8, tails='linear', tail_bound=5),
    transforms.LULinear(24)])
modules.pop()
transform = transforms.CompositeTransform(modules)
base_distribution = nflows.distributions.StandardNormal(shape=[24])
flow = flows.Flow(transform=transform, distribution=base_distribution)

and used the negative log likelihood loss

def nll(batch, model):
    log_prob = -1 * model.log_prob(batch).mean(0)
    return log_prob

to train it on this dataset from the planar_datasets.py:

training_dataset = FourCircles(50000).data
training_dataset = target_dataset.repeat(1, 12)

As you can see, I repeated the dataset on purpose to blow up the dimensionality of the problem. When I keep the dataset in its regular 2-D form, everything works fine. But in this version, after training for 2 epochs (Adam optimizer, lr 0.001, cosine annealing, batch size 64), my loss goes down to as low as -45. It goes even lower if I train longer. That means my likelihood must be > e^45, which should not happen.

The reason might be a conceptual misunderstanding on my side, as I also ran into this issue with a custom implemented Masked Autoregressive Flow. I hoped I have a bug and using the nflows framework to implement my model and log_prob() computation eradicates this issue, but unfortunately this did not happen.

Any ideas how this can be explained?

Cheers :)

EDIT: I made some more observations that might help to clearify this: In such a high dimensional space, the density of the base distribution is embedded in a larger volume and hence the likelihoods of noise points becomes very small everywhere. This leads to a greater influence of the jacobian determinant on the negative log likelihood and ultimately to it blowing up.

imurray commented 4 years ago

Consider a 2D density, where x₁ comes from a standard normal and x₂=x₁. What is the true density? It's zero everywhere, except on the line x₁=x₂. That line has zero area, so the true density on that line must be infinite for the density to integrate to one. The same will happen for any density confined to a lower-dimensional manifold than your full input space. While there is work on flows on manifolds, the standard flows in this package (so far) assume you are modelling a finite density.

cschoeller commented 4 years ago

Thanks for your answer. But I'm confused, my base density and output distribution both are 24 dimensional.

imurray commented 4 years ago

In my example the dataset is 2D, but all the data happens to be on a 1D line. In your example the dataset is 24D, but all the data happens to be on a 2D manifold. The true density is zero "almost everywhere". Any point that doesn't look like it's a 2D point that's been duplicated 12 times (which is almost every point in the 24D space) has zero density under your true generative process.

It's really worth understanding these issues, because otherwise comparisons between models are just driven by which model can exploit artifacts the quickest. For example, if you make D-dimensional vectors all zero mean, it's now on a (D-1)-dimensional manifold and the true density is again infinite on the data (any vector that doesn't add up to zero, which is most of them, has zero density).

cschoeller commented 4 years ago

Ok, I see what you mean. I wasn't aware of this 'issue'. In fact, I built this toy example for testing purposes after I hit the same problem with a different dataset in an applied domain. Its very likely such centering issues you describe are the reason for this problem with the other dataset as well. I will look into this again, maybe I find a solution. Thank you!

imurray commented 4 years ago

If you have zero-mean data vectors, an easy fix is to discard one of the features. You can perfectly reconstruct any feature from all of the others, so don't need to model it.

If you have quantized data (e.g., discrete pixel values in 0..255), you can add noise to it, or read the more recent literature on dequantization.

If your data contains "atoms", special values that correspond to spikes of infinite density, you'll need to remove these, or model them outside of a flow.

arturbekasov commented 4 years ago

If it turns out that your data lies on a non-trivial lower-dimensional manifold, both [1] and [2] propose methods for learning the lower-dimensional manifold and the density on it simultaneously. The code for [1] is based on nflows and is readily available.

[1]: Flows for simultaneous manifold learning and density estimation by Brehmer & Cranmer [2]: Regularized Autoencoders via Relaxed Injective Probability Flow by Kumar et. al

cschoeller commented 4 years ago

Thank you both again so much! Your answers save me a lot of time and frustration :).

It turns out, my problem was very similar to your zero mean example: I work with some type of time series, to illustrate this, lets take as an example a scalar series [x_1, x_2, ... , x_n]. Interpreted as a data point / vector of n dimensions, this is what I train a flow on. To make my data more 'homogeneous', I subtracted from each xi the value of x(-1). I believe this had a similar effect as zero-centering, because x_(-1) can potentially be very close to x_1 (does that make sense?). I will now just try to scale the dataset as a whole with a constant instead.

imurray commented 4 years ago

It doesn't makes sense to me, no. If sometimes one time step is an exact copy of the previous one, you have infinite densities with or without taking differences. If your data is quantized (and you don't do anything about it), exact duplicates in the time series are especially likely to be a problem.

If each time step is just very close, then the pre-processing that you've been doing (modelling the differences) is likely to make things numerically better though. You're removing really strong covariances.

Obvious linear dependencies (for example, from making each time-series zero mean) would be caught by fitting a multivariate Gaussian baseline. That is, check that the covariance of your data is full rank. But even if your covariance is full rank, you can have non-linear deterministic dependencies that mess you up. Maybe start with an autoregressive model and look at the predictions for each element of your time-series given the previous ones.

cschoeller commented 4 years ago

Yes, you are right. I investigated further and it often (perhaps too often) happens that adjacent timesteps in a series have the same values, i.e. the series is temporarily plateauing.

cschoeller commented 4 years ago

I suppose this issue is not the right place for such discussions. But if you don't mind, I have one more practical question:

In the timeseries I described, each sequence has n steps. But sometimes my data is incomplete, i.e. some tail steps n-m to n are missing. In my training I discard these partial samples so far, as I really want to learn the distribution over all timesteps. Another way to handle this would be that I impute the missing timesteps, but this imputation would always contain errors that the model picks up.

Ideally, I would like to mask out these missing timesteps to use incomplete samples for training anyways. For example, for the concerned samples I could set the jacobian diagonal for the missing dimensions to 1 and work with a lower dimensional gaussian to compute the noise likelihood. So my model would be trained on a n-m dimensional subspace for some samples. But would my model then still learn a proper density over all n dimensions?

imurray commented 4 years ago

I'm bowing out of supervising your research here, sorry. I'll repeat my suggestion to consider an autoregressive model though: easy to marginalize out the end of a time series by just not predicting it.

cschoeller commented 4 years ago

I understand, thanks anyways!