Number of Updates - Githubissues

johnryan465 commented 2 years ago

The "number of parameter updates" in the paper is quite ambiguous and using the interpretation we currently have our models are not performing anywhere near as well. Any thoughts on it?

LyndonFan commented 2 years ago

The wording seems right, but it may be possible that it should be the number of batches / epochs. Alternatively, it seems like we are only using 1 sample for estimating the ELBO, but might affect the accuracy of the loss.

LyndonFan commented 2 years ago

It does seem like the shapes are a bit odd. If in the training loop we change torch.tensor([100,1]) to torch.ones([100,1]) like in sample and in the notebook, the code breaks for all types of layers except planar flow. Maybe it's to do with how we implement inverse and/or elbo?

LyndonFan commented 2 years ago

I'm comparing the notebook of the latest version of the code and a past version that still works (https://github.com/ATML-Group-12/normalising_flows/tree/da1165cf9a1e4debf0a1edf0ef2e422c089f90e4). I've run visual.ipynb for both times and the past version gives a result close to the paper but the current version is ... messy at best.

I've dug a bit and here are some things that changed and might impact the performance:

typing TransformModule instead of Transform
embedding incoporating the shape of x as opposed to just returning a MultivariateNormal(self.mean,self.cov) (undoing this makes the final distribution clearer in the last cell)
clamping the true_density log_prob in ELBO (this seems fine, undoing this makes the loss go to inf)
transposing the true_density (as in #18 )

LyndonFan commented 2 years ago

This seems to remain an issue after the above adjustments. The radial flow with flow length 2 seems to be performing the best, but its performance gets worse as the flow length increases. Planar flow, NICE orthogonal and permutation basically don't learn much. Moreover, this always comes up:

  File "/Users/lyndonf/Desktop/normalising_flows/nice/model/model.py", line 24, in forward
    dist = self.embedding(x)
  File "/Users/lyndonf/anaconda3/envs/flows/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/lyndonf/Desktop/normalising_flows/flows/embedding/basic.py", line 31, in forward
    return MultivariateNormal(self.mean.unsqueeze(0).repeat(K, 1), self.cov.unsqueeze(0).repeat(K, 1, 1))
  File "/Users/lyndonf/anaconda3/envs/flows/lib/python3.9/site-packages/torch/distributions/multivariate_normal.py", line 146, in __init__
    super(MultivariateNormal, self).__init__(batch_shape, event_shape, validate_args=validate_args)
  File "/Users/lyndonf/anaconda3/envs/flows/lib/python3.9/site-packages/torch/distributions/distribution.py", line 55, in __init__
    raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (1, 2)) of distribution MultivariateNormal(loc: torch.Size([1, 2]), covariance_matrix: torch.Size([1, 2, 2])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan]], grad_fn=<ExpandBackward0>)

whether in training or sampling. This appears to be the case before #18 as well.

johnryan465 commented 2 years ago

The lack of improvement for larger models is why I was curious about whether or not the number of training iterations should be impacting by the number of parameters as we are training the larger models less.

johnryan465 commented 2 years ago

Nice doesn't learn properly?

johnryan465 commented 2 years ago

The flows are working okay for me, there is an issue of inputing a tensor of [1,1] instead of [[1]] in a few places.

johnryan465 commented 2 years ago

@LyndonFan what was the purpose of the Basic Embeddings much more complex covariance structure? I'm not sure I follow the purpose of it. Having torch.no_grad in the forward pass in generally not a good idea

LyndonFan commented 2 years ago

It was because when training NICE permutation, the tensor sometimes became not positive definite, leading to an error. Maybe there is a way to enforce the structure of L_below without using torch.no_grad?

johnryan465 commented 2 years ago

Ah so it was for NICE gotcha.

johnryan465 commented 2 years ago

@LyndonFan I think 500k parameter updates must mean update each parameter 500k times. Fewer epochs to converge with larger models doesn't make sense to me.

johnryan465 commented 2 years ago

Additionally if not for most models the value we are optimising will not actually be the elbo due to the annealed version not hitting 10k interations.

LyndonFan commented 2 years ago

Sounds fair, plus our current methods don't quite work. I think I tried 5k iterations (though it was 500000 / 100 for minibatch) but it still didn't work. I'll set it to 500k then.

johnryan465 commented 2 years ago

Which methods? So for the flows there is an input shape problem which I think I've rectified but am testing atm.

johnryan465 commented 2 years ago

I'm watching Planar with k=8 and it seems to be training okay but needs much more than 10k iterations.

LyndonFan commented 2 years ago

By current methods I just mean what we've fixed / tried before, I wasn't referring to a particular energy function / layer type

johnryan465 commented 2 years ago

The observations i am making of the training is that quite quickly it latches on to one of the nodes and then doesn't split

johnryan465 commented 2 years ago

Actually the losses agree with the claimed performance in the graphs, just not the density plots.

johnryan465 commented 2 years ago

It is unclear to me what they mean by this plot, because how I'm interpreting it our model is actually much better for the first density function somehow.

Our planar flow k=8 hits -1 variational bound easily. Current run hits -1.5

ATML-Group-12 / normalising_flows

Number of Updates #16