Closed Friedrich-M closed 1 year ago
Thank you for your interest in our work!
The denoising UNet is a simple 2D UNet commonly used in image diffusion models. We are not the first to adapt this network to 3D triplane (checkout 3D Neural Field Generation using Triplane Diffusion).
The triplane features are reshaped into a 2D feature map before being fed into the UNet. We let the convolution and attention layers to aggregate the information from different planes without explicitly designing a 3D architecture, just like in the original EG3D that also generates 3D-aware triplanes using a 2D architecture. However, there do exist better network designs that have emerged recently (e.g. RODIN Diffusion), which could be plugged into our single-stage framework in the future.
The NeRF decoder refers to the MLP layers and volume rendering module. We use torchngp as our renderer, and the MLP details are shown in Fig. 3. Anyway, you could use any NeRF architectures in theory.
Your answers are really helpful. Thank you for your expertise and time!
I have read your paper on SSDNeRF and found it very interesting. I have a few questions about the implementation details that I could not find in the main text or supplementary materials. I would greatly appreciate it if you could provide some clarification.
Could you please explain how the denoising network UNet is designed in your approach? Is it a traditional 2D Unet architecture?
How do you aggregate the information from the three planes in the triplane during your approach, while ensuring that the network is 3D-aware?
Could you provide more information on how the NeRF decoder is designed and implemented? If possible, sharing some relevant code snippets would be very helpful.
Thank you for your time and effort in addressing these questions. I am looking forward to learning more about your work and its underlying techniques.