Closed MikeLasz closed 2 years ago
Thanks @MikeLasz. I do understand your motivation, and that it's valuable to pin down where differences come from. Just as a warning though, it's unlikely anyone will have time to help you with this issue. I have my hands full with teaching and other commitments, and the student authors of these papers who wrote the code have moved on to full time jobs. It's not a core aim of this repository to reproduce other papers. The MAF and Neural Spline Flows papers both have frozen repositories with the code used to get their results.
By the way, I'm pretty sure the MAF paper is reporting standard errors for single fitted models (I wish it had said that rather than standard deviations, I'm sorry), so I don't think the error bars are capturing the expected variation you'd see on refitting. But the code is there if you can be bothered to get it running, which could be a bit of a fight. It's a theano code base, so will require getting an old environment running, and it's likely some details (e.g. defaults for initialization of layers) are different than the nflows code.
Thank you for your kind (and very quick) response @imurray ! In case that there won't be any further ideas (perhaps the specification of a parameter that is set to another default value or so), I think we can close this issue soon.
Just wrapping my head around this post, @MikeLasz.
You say your experiments result in a log-likelihood value of around 17.9
or -17.9
compared to test log-likelihood loss of about -17.70
?
It is actually a bit worse than -17.9
. Running the experiment 3 times, I got test log-likelihoods of -17.943886, -18.005745, -17.997137
. I know that it is hard to talk about significantly different results, but taking the small standard deviations/errors into account, it actually might be statistically significant.
I wonder if it's indeed due to some details like the initialization or due to mistakes in my implementation.
Hi! I am trying to reproduce some benchmark results such as the results from the original MAF paper https://arxiv.org/pdf/1705.07057.pdf . For instance, MAF with 5 layers on HEPMASS achieves a test log-likelihood loss of about -17.70 with a standard deviation of essentially 0. When recomputing the experiments using this library, I get significantly different results (taking the tiny standard deviation into account). Let me summarize some hyperparameters from the original paper that allow us to rebuild the employed MAF: I) Model architecture:
II) Optimization:
Hence, the essential ingredients for retraining a MAF are the flow instance:
MaskedAutoregressiveFlow(features=D, hidden_features=512, num_layers=5, num_blocks_per_layer=1, use_residual_blocks=False, batch_norm_between_layers=True)
(Note, that this implementation of MAF employs a Batch-Norm layer between each autoregressive flow layer. However, using Batch-Norm after every 2 autoregressive layers, I get similar results.); and the optimizer:optim.Adam(flow.parameters(), lr=1e-4, weight_decay=1e-6)
.Data is obtained and preprocessed according to the original implementation https://github.com/gpapamak/maf .
I am wondering if my different results are due to the implementation in nflows or because I missed some architectural details. Besides helping me to reproduce the results, I think it would be beneficial to extend the example section of this repository by some benchmark computations like these. I am willing to help you with this task.
In the following, I present a minimal example that reproduces my results (Running this 3 times, I get final test losses of 17.943886, 18.005745, 17.997137):
Thank you very much!