YannDubs / Neural-Process-Family

Code for the Neural Processes website and replication of 4 papers on NPs. Pytorch implementation.
https://yanndubs.github.io/Neural-Process-Family/
MIT License
202 stars 45 forks source link

Why using posterior sampling for evaluation #5

Closed xuesongwang closed 3 years ago

xuesongwang commented 3 years ago

Dear Dubois, First of all, thanks for this amazing project! The NPF general framework is beautifully designed. However, I have one issue regarding model evaluation.

For latent based methods, you mentioned in the website that "when evaluating we will evaluate the log likelihood using posterior sampling". And based on the code below:

https://github.com/YannDubs/Neural-Process-Family/blob/892d0439614804ee671d66464fcb7d46ab43629b/npf/neuralproc/base.py#L500-L505

is_q_zCct =True and Y_target are given, the posterior distribution is used for inference.

But why not masking Y_target during evalution and saving model based on results sampled from prior distribution (R from context X, Y)? Would it be possible that a model learns a good decoder(posterior_z_samples, X_target) but the divergence between D(post_z || prior_z) is large? In that case the performance on testing set will be horrible.

Any insight will be appreciated. Thanks.

YannDubs commented 3 years ago

Hi @xuesongwang thanks for the kind words but I don't completely understand what you are saying. Are you essentially suggesting to mask Y_trgt but still use X_trgt during evaluation ? If so I don't really see how that can help, the decoder already has access to X_trgt ...

xuesongwang commented 3 years ago

Hi @xuesongwang thanks for the kind words but I don't completely understand what you are saying. Are you essentially suggesting to mask Y_trgt but still use X_trgt during evaluation ? If so I don't really see how that can help, the decoder already has access to X_trgt ...

Thanks @YannDubs for pointing out. What I was trying to say is that in model evaluation, the encoder already has access to both X_trgt and Y_trgt by R_from_trgt = self.encode_globally(X_trgt, Y_trgt) and then generates this z_sample (post distribution) for decoding afterwards. However, during testing this z_sample is obtained from prior distribution, R = self.encode_globally(X_cntxt, Y_cntxt) due to lack of Y_trgt, i.e., the ground truth. If there's a distributional gap between D(z_sample_post || z_sample_prior), the R and R_from_trgt will be different, resulting in an underperformed model on testing set. Hence, why not using prior distribution to save model instead? And to achieve this Y_trgt can be set to None during evaluation so that the sampling_dist = q_zCc is used

YannDubs commented 3 years ago

but the KL divergence that is used essentially ensure that the distributional gap is small. It's the same reason that you use the posterior during training of a VAE (i.e. condition on image) but only generate from the prior when evaluating a VAE.

Check out section 4.1 of ConvCNP paper for the derivations / theoretical explanations: https://arxiv.org/pdf/2007.01332.pdf