ayaanzhaque / instruct-nerf2nerf

Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions (ICCV 2023)
https://instruct-nerf2nerf.github.io/
MIT License
769 stars 64 forks source link

SDS Baseline #43

Closed jaidevshriram closed 1 year ago

jaidevshriram commented 1 year ago

This is a really fun paper! Thanks for making the code public. I was hoping to get some more information about the SDS baseline in the paper - I'm hoping to replicate it at the moment. Specifically, the size of the training images, the guidance weights, and batch size. The results look quite good IMO and far from the saturated colours that SDS in Dreamfusion produces - so that's quite interesting to see. Would love your input! Thanks!

ayaanzhaque commented 1 year ago

The implementation code for SDS + IP2P is pretty messy currently, so it'll take me a bit of time to clean up. In the meantime if you would like to write it on your own, take a look at this code: https://github.com/nerfstudio-project/nerfstudio/blob/generative/nerfstudio/generative/stable_diffusion.py

This is the SDS loss implementation in Nerfstudio, which I based my code off. Basically, you will have to swap out Stable Diffusion with InstructPix2Pix. You can use the code in the instruct-nerf2nerf repo to see how to load and use InstructPix2Pix. Then, once you calculate the loss, just pass it through to the pipeline and compute the loss. Hopefully this helps, but otherwise, I'll take a day or two to get the code cleaned up for you. The reason it will take some time is because it requires refactoring the dataloader, pipeline, and model classes. Hopefully this helps!

In terms of hyperparameters, the guidance scales are the same as what we use for Iterative Dataset Update, so they do have to be tuned a bit to get the right edit that you want. The reason you don't see the high saturation is likely because the guidance scales are low. We used a batch size of 1, and the images were rendered at around 10-16k rays (pixels), and the resolution varied based on the initial resolution. I recall they were something around 128x80, but either way the resolution was quite low as we couldn't carry so many gradients. Then we resize to 512x512 before inputting into the Stable Diffusion Autoencoder. In our implementation, we compute the loss in image space, where we basically generate the edited image by passing it through InstructPix2Pix, and use w * (render - edited_render) as our gradient. The rest of the implementation should be the same as the code link I shared. Good luck! P.S. The generative branch in Nerfstudio should be merged very very soon, so that could give you a better starting point too

jaidevshriram commented 1 year ago

Thank you!! I think I have enough information to reproduce it now thanks. I have a question regarding the updated loss function:

In our implementation, we compute the loss in image space, where we basically generate the edited image by passing it through InstructPix2Pix, and use w * (render - edited_render) as our gradient.

Do you mean that instead of doing one denoising step for the UNet - you complete the whole diffusion trajectory and compute the diff between the edited image and render? If so, is that equivalent to the L2 diff between the render and edited image?

The generative branch in Nerfstudio should be merged very very soon, so that could give you a better starting point too

Nice to hear! Looking forward to it!

ayaanzhaque commented 1 year ago

It's not the full reverse diffusion process, if you take a look at the code in this repo for diffusing an image, we take 20 steps. You can use the loop that we have in the ip2p.py file in this code, but you will have to make a few changes to make it work for Stable Diffusion. Hopefully this helps, I'll close this issue for now, but let me know if you need any other help!