Closed hangg7 closed 2 years ago
Yes, we have tested NSFF, but in the end we didn't include it mainly because NSFF does not aim for clean decomposition of the scene, but rather just focuses on the overall reconstruction quality. For videos where camera motion is large, it generally cannot achieve good decomposition. Besides, NSFF is not self-supervised but rather requires ~masks of the dynamic objects in the scene~ depth and optical flow to work, making the comparison unfair.
Here are the results of a quick run on NSFF (left dynamic, right static): Training was done with fewer iterations so a lot of fine details are not included yet, but the decomposition already fails, probably because of the inaccurate dynamic masks obtained by the pre-trained segmentation method that's used by NSFF.
Hope that helps!
Interesting! Thanks for your quick response and thoughtful explainations.
(1) The results look a bit weird to me since one would expect the 2D flow supervision to help with the decomposition of foreground and background.
(2) I am slightly confused about what you meant by "(NSFF) requires masks of the dynamic objects in the scene to work". Do you mean that it requires depth and flow estimation from off-the-shelf models, or that it requires foreground mask for COLMAP SfM? If you mean latter, I guess that is a quite general issue that actually apply to every dynamic NeRFs currently.
Thanks again!
After more careful thoughts, I think the fact that D2NeRF works better than NSFF without requiring flow supervision is promising. In this way, it would be very helpful to show this kind of comparison potentially in future revision, it both shows the problem in the previous work and convince that your model is stronger -- I am sure community will find it helpful for future research!
(1) The results look a bit weird to me since one would expect the 2D flow supervision to help with the decomposition of foreground and background.
The main issue with NSFF decomposition is the "blending style merging" they have -- they predict a blending weight from the dynamic component and merge both components based on this weight. This usually doesn't allow a clean separation. This issue posted on NSFF repo gives a wonderful discussion regarding this.
(2) I am slightly confused about what you meant by "(NSFF) requires masks of the dynamic objects in the scene to work". Do you mean that it requires depth and flow estimation from off-the-shelf models, or that it requires foreground mask for COLMAP SfM? If you mean latter, I guess that is a quite general issue that actually apply to every dynamic NeRFs currently.
Sorry my bad, I mixed NSFF up with some other methods that require explicit mask supervision. Depth and flow supervision would be the things I wanted to mention.
After more careful thoughts, I think the fact that D2NeRF works better than NSFF without requiring flow supervision is promising. In this way, it would be very helpful to show this kind of comparison potentially in future revision, it both shows the problem in the previous work and convince that your model is stronger -- I am sure community will find it helpful for future research!
Thank you very much for your suggestions! We totally agree and would definitely try to include additional comparisons with NSFF in the future revision.
Closing this now, thanks again for releasing your code and paper at the same time -- it is very helpful for playing around with it. Good luck!
I have tried NSFF on the vrig-peel-banana
sequence, and this is what I got (after a day or so):
Composed | Foreground | Background | Foreground mask
Not perfect but a bit better than what you have showed. I do find hyperparameters sensitive, and used the ones from HyperNeRF (which is actually now the default params in zhengqi's repo). I also have to modify the code to support HyperNeRF's camera (per-camera focal and principle point etc.).
The blending mechanism is indeed problematic for good decomposition, as foreground actually captured a lot of background scene. @d2nerf I am wondering when you show foreground rendering, do you apply the foreground mask? I would assume using the updated version which uses additive blending would give you better results, although I have not tested yet. Just posting here as future reference!
Hi Hang, thanks a lot for posting the results and they indeed look a lot better! May I ask if it's possible to share your modified code with us so we can find out what we might have done wrong? We have actually tried both adapting HyperNeRF's camera or using the scripts provided by NSFF to extract parameters from COLMAP output to train the scenes, but the decomposition all appears quite bad. As for the hyperparameters, we are also using the new configurations fine-tuned for scenes with a large number of frames.
I am wondering when you show foreground rendering, do you apply the foreground mask?
No, for all renderings of dynamic objects in the paper & website, we simply just query and render the dynamic component, without applying any mask at all. I believe this issue with NSFF is indeed caused by the merging style blending and using additive style should be able to resolve it (and probably improve reconstruction quality as well, as the dynamic component no longer needs to learn those details of background which are actually never rendered, not tested tho).
Here you are! Though I have not clean it up or anything. You should apply it on the original repo.
git clone https://github.com/zhengqili/Neural-Scene-Flow-Fields nsff
cd nsff
wget https://github.com/d2nerf/d2nerf/files/9057882/nsff.patch.zip & unzip nsff.patch.zip
git apply --reject --whitespace=fix nsff.patch
I am off to try out nsff_pl. We had some experiments on our own sequences, it indeed performs reasonably well comparing to the original repo with a single set of hyperparams over all sequences, which is quite nice. Will see how it works on the banana sequence.
Hi Hang, thank you very much for sharing the code! However it seems that the git patch you shared doesn't contain any actual changes to the NSFF code? (I could only see the linting changes, loading of the data is still from _load_data()
and no per-camera focal/principle point is enabled? PS. I don't think that has led to the differences in our results, as in our experiments we only used the right camera to train the model, so focal and principle point are consistent at least for the training views, so I'm wondering is there anything else you have changed in particular?)
After a closer look at the results, I think it might be related to the initial batch samples of the training, as well as the specific training view being rendered. The below images are the static background at 10k iter and 70k respectively:
So I'm wondering do the results you got all look as good as the ones you posted before?
Hi please search for something like cxcy
in the patch. That's the main modification I made. I have not touch the model itself. The training of the original codebase is definite sensitive to initialization. I tried training a few time and results vary as you pointed out.
I mentioned this repo. It made a few modifcation and get rid of that problem. I've found it quite stable and can separate things well. Here is something I got.
Notably, the main modifcation of this repo is to change the original composition from blending to NeRF-W/D2-NeRF style summation. I am not sure if it can be called NSFF anymore (maybe some new variant) but as you can see the results are quite good and it is not as sensitive as D2-NeRF (the default hyperparams works for all sequences we tested).
Ah, thank you for sharing the results and this indeed looks a lot better. Yeah I suppose the main issue with original NSFF is the averaging style allows for weird radiance in each component, as long as in the end their weighted average has expected color. I'll try to run a few more experiments with this improved version of NSFF to see if it works well on our dataset.
A quick update: it seems that the improved version of NSFF still doesn't quite work with dynamic shadows. This is probably expected as it is almost impossible to obtain the correct optical flow for shadows. Here are the results from nsff_pl:
Hi dear authors,
Thanks again for sharing this work. I am wondering have you guys tried comparing against NSFF? It also has decompositing components and seems to work reasonably. Would be a very strong baseline!
Thanks, Hang