Quantitative results in Tab. 1 from the main paper

tobias-kirschstein commented 1 year ago

Am I right to assume that the numbers in Table 1 from the main paper were obtained by training DyNeRF on more scenes than the 6 available in this GitHub repo? If so, it is basically impossible to compare against the original DyNeRF as neither code nor the data from the quantitative comparison are available.

My doubts stem from the fact that DyNeRF shows qualitative results on two additional scenes with multiple moving people. Furthermore, in the appendix it says that the model was trained on:

a 60 second video sequence (flame salmon) <- but the flame_salmon_1 sequence available in the repo is only 40 seconds long!
5 other 10 seconds cooking videos <- those are available
one 25 seconds indoor video <- Not available. probably the one with multiple people moving. Is this part of the evaluation?
a few additional videos of outdoor scenes <- are those part of the evaluation?

Lastly, the LPIPS scores seem very low to me. How were these scores computed for the paper? If torchmetrics implementation was used, there is a common mistake to not use normalize=True which artificially lowers the computed LPIPS scores (See for example: https://github.com/nerfstudio-project/nerfstudio/issues/1424).

It would be great if the authors could clarify these points.

grafik

zhaoyang-lv commented 1 year ago

No. The training and evaluation is done on the DyNeRF salmon subset (10-second sequence). The table here says "10-second sequence". I am not sure where we confused you. The evaluation is done on a hold-out center camera view (camera 0).

zhaoyang-lv commented 1 year ago

We did have results on additional sequences in our study. Due to overhead in assets open-source, we did not get to the state to release those. But we did all the ablations and comparisons on the released subset.

zhaoyang-lv commented 1 year ago

For LPIPS, we implemented our own, not using the pytorch implementation. All numbers in the table are calculated using the same evaluation script.

tobias-kirschstein commented 1 year ago

Thanks for coming back to me so quickly!

The training and evaluation is done on the DyNeRF salmon subset (10-second sequence)

Which of the 6 released sequences is the DyNeRF salmon subset? grafik flame_salmon_1 is 40 seconds long

But we did all the ablations and comparisons on the released subset

So, the results in Table 1 were done on exactly the 6 sequences released in this repository?

For LPIPS, we implemented our own, not using the pytorch implementation

Would it be possible to share some details of this evaluation script?

zhaoyang-lv commented 1 year ago

Which of the 6 released sequences is the DyNeRF salmon subset?

The flame_salmon_1 subset is. It should be the first 10 seconds. You can confirm comparing to the figure used in the ablation study.

So, the results in Table 1 were done on exactly the 6 sequences released in this repository?

To clarify, as said in the table caption, it is calculated only using that 10-second snippet in the repository.

Would it be possible to share some details of this evaluation script?

Unfortunately no. :( Our code is heavily dependent on our internal code repo which restricts us from sharing it outside.

tobias-kirschstein commented 1 year ago

The flame_salmon_1 subset is. It should be the first 10 seconds

Got it! Thanks for clarifying this. I think this should be highlighted somewhere. I have seen the numbers from table 1 been quoted in other studies (e.g., Nerfplayer, HyperReel) but side-by-side with numbers that where computed on all of the 6 sequences from this repository, which of course is wrong since those numbers are not comparable.

Unfortunately no. :( Our code is heavily dependent on our internal code repo which restricts us from sharing it outside.

I understand. For us, it would be enough to know which pretrained image encoder was used (AlexNet or VGG) and whether you fed the images as tensors with range [0,1] or [-1,1] to the image encoder.

zhaoyang-lv commented 1 year ago

I think this should be highlighted somewhere.

I will make an update soon in the readme to highlight this. Hopefully later this week when I got time. Thanks for the suggestion. :)

which pretrained image encoder was used (AlexNet or VGG)

I can confirm we use Alexnet for this. :)

facebookresearch / Neural_3D_Video

Quantitative results in Tab. 1 from the main paper #23