Open johnryan465 opened 2 years ago
The wording seems right, but it may be possible that it should be the number of batches / epochs. Alternatively, it seems like we are only using 1 sample for estimating the ELBO, but might affect the accuracy of the loss.
It does seem like the shapes are a bit odd. If in the training loop we change torch.tensor([100,1])
to torch.ones([100,1])
like in sample and in the notebook, the code breaks for all types of layers except planar flow. Maybe it's to do with how we implement inverse and/or elbo?
I'm comparing the notebook of the latest version of the code and a past version that still works (https://github.com/ATML-Group-12/normalising_flows/tree/da1165cf9a1e4debf0a1edf0ef2e422c089f90e4). I've run visual.ipynb for both times and the past version gives a result close to the paper but the current version is ... messy at best.
I've dug a bit and here are some things that changed and might impact the performance:
x
as opposed to just returning a MultivariateNormal(self.mean,self.cov)
(undoing this makes the final distribution clearer in the last cell)This seems to remain an issue after the above adjustments. The radial flow with flow length 2 seems to be performing the best, but its performance gets worse as the flow length increases. Planar flow, NICE orthogonal and permutation basically don't learn much. Moreover, this always comes up:
File "/Users/lyndonf/Desktop/normalising_flows/nice/model/model.py", line 24, in forward
dist = self.embedding(x)
File "/Users/lyndonf/anaconda3/envs/flows/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/lyndonf/Desktop/normalising_flows/flows/embedding/basic.py", line 31, in forward
return MultivariateNormal(self.mean.unsqueeze(0).repeat(K, 1), self.cov.unsqueeze(0).repeat(K, 1, 1))
File "/Users/lyndonf/anaconda3/envs/flows/lib/python3.9/site-packages/torch/distributions/multivariate_normal.py", line 146, in __init__
super(MultivariateNormal, self).__init__(batch_shape, event_shape, validate_args=validate_args)
File "/Users/lyndonf/anaconda3/envs/flows/lib/python3.9/site-packages/torch/distributions/distribution.py", line 55, in __init__
raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (1, 2)) of distribution MultivariateNormal(loc: torch.Size([1, 2]), covariance_matrix: torch.Size([1, 2, 2])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan]], grad_fn=<ExpandBackward0>)
whether in training or sampling. This appears to be the case before #18 as well.
The lack of improvement for larger models is why I was curious about whether or not the number of training iterations should be impacting by the number of parameters as we are training the larger models less.
Nice doesn't learn properly?
The flows are working okay for me, there is an issue of inputing a tensor of [1,1] instead of [[1]] in a few places.
@LyndonFan what was the purpose of the Basic Embeddings much more complex covariance structure? I'm not sure I follow the purpose of it. Having torch.no_grad in the forward pass in generally not a good idea
It was because when training NICE permutation, the tensor sometimes became not positive definite, leading to an error. Maybe there is a way to enforce the structure of L_below without using torch.no_grad?
Ah so it was for NICE gotcha.
@LyndonFan I think 500k parameter updates must mean update each parameter 500k times. Fewer epochs to converge with larger models doesn't make sense to me.
Additionally if not for most models the value we are optimising will not actually be the elbo due to the annealed version not hitting 10k interations.
Sounds fair, plus our current methods don't quite work. I think I tried 5k iterations (though it was 500000 / 100 for minibatch) but it still didn't work. I'll set it to 500k then.
Which methods? So for the flows there is an input shape problem which I think I've rectified but am testing atm.
I'm watching Planar with k=8 and it seems to be training okay but needs much more than 10k iterations.
By current methods I just mean what we've fixed / tried before, I wasn't referring to a particular energy function / layer type
The observations i am making of the training is that quite quickly it latches on to one of the nodes and then doesn't split
Actually the losses agree with the claimed performance in the graphs, just not the density plots.
It is unclear to me what they mean by this plot, because how I'm interpreting it our model is actually much better for the first density function somehow.
Our planar flow k=8 hits -1 variational bound easily. Current run hits -1.5
The "number of parameter updates" in the paper is quite ambiguous and using the interpretation we currently have our models are not performing anywhere near as well. Any thoughts on it?