VITA-Group / GNT

[ICLR 2023] "Is Attention All NeRF Needs?" by Mukund Varma T*, Peihao Wang* , Xuxi Chen, Tianlong Chen, Subhashini Venugopalan, Zhangyang Wang
https://vita-group.github.io/GNT
MIT License
340 stars 25 forks source link

Training strategy for the released model #14

Closed zhuhd15 closed 1 year ago

zhuhd15 commented 1 year ago

Hi, thanks for the fantastic work!

I've been attempting to replicate the results using the training configurations provided in the repository. However, it appears that the iterations of the pretrained model don't quite align with the instructions in the configs. In your paper, you mentioned training GNT with N_rand set to 4096 for 250k iterations across all examples, while in the released model, it seems that much longer iterations were employed based on the model names (for instance, the fern model was trained for 840k iterations, while the generalization model underwent 720k iterations).

As I attempted to train the models following your configs, I noticed a significant discrepancy compared to the released models. I was wondering if you could possibly update the configurations or training strategies so that we can accurately reproduce the numbers for the model you released. Thank you so much!

MukundVarmaT commented 1 year ago

Hi @zhuhd15,

Although the released models have been trained for longer, I have observed that there is no change in the metric values after 250-300k iterations. It's just that since I had the later checkpoints, we released these checkpoints. Please let me know if you are facing any trouble in reproducing these numbers.

zhuhd15 commented 1 year ago

Thank you so much for your prompt reply!

We have downloaded the model you released and we have tested on three cases: 1) single scene for drums 2) generalizable setting for LLFF and 3) generalizable setting for Synthetic dataset. We have seen the numbers as follows using the provided scripts in the repo (drums category as an example):

Drums (single scene) | PSNR | LPIPS | SSIM -- | -- | -- | -- GNT* (reported in paper) | 28.32 | 0.030 | 0.966 GNT* (model released by GNT, 500k) | 27.85 | 0.0336 | 0.9630 GNT (reproduced, 250k, N_rand 1024) | 26.84 | 0.0516 | 0.9532 GNT (reproduced, 250k, N_rand 4096) | 27.60 | 0.0444 | 0.9592

It seems the retrained model with N_rand as 4096, following your paper, has worse lpips and ssim scores, and their relative gaps are not so small, especially for lpips (~25% relative difference). I wonder if there is anything we can do to reproduce the number of your released model.

Thanks!

MukundVarmaT commented 1 year ago

We have observed that although GNT renders quite reasonably well (in most cases), places that have a plain background seem to be a shade darker than the ground truth (an inherent drawback of using attention). For example the white background in the case of drums. To verify, please try identifying the background (either using the ground truth mask or using any other segmentation method) and force-setting it to white, and then recomputing the above metrics.

zhuhd15 commented 1 year ago

Thanks for your response!