fraunhoferhhi / gaussian_gan_decoder

Official implementation for: https://arxiv.org/abs/2404.10625
https://florian-barthel.github.io/gaussian_decoder/index.html
43 stars 1 forks source link

About training and optimization #9

Closed yangqinhui0423 closed 3 days ago

yangqinhui0423 commented 1 month ago

Dear author, I have some questions about the training part in the article. I wonder, during the process of training the decoder, is optimizing the decoder weights done at the same time as optimizing the Gaussian parameters? Or do we first optimize the Gaussian parameters and then optimize the decoder weights by the parameters obtained by decoder decoding?

yangqinhui0423 commented 1 month ago

Dear author, I have some questions about the training part in the article. I wonder, during the process of training the decoder, is optimizing the decoder weights done at the same time as optimizing the Gaussian parameters? Or do we first optimize the Gaussian parameters and then optimize the decoder weights by the parameters obtained by decoder decoding?

Maybe first get the optimized Gaussian parameters and then use the parameters to optimize the decoder? The goal is just to train the decoder. I see the words below. image

Florian-Barthel commented 1 month ago

Hey, I hope I understand your question correctly. We do not optimize the parameters of the Gaussian Splatting. Instead the decoder learns to directly predict the Gaussian splatting parameters using the information of the 3D GAN tri-plane.

This means that we do not have any ground truth parameters during the training. Nevertheless, a loss can still be computed by comparing the similarity of the renderings of the NeRF based 3D GAN and the rendering of the decoded Gaussian splatting scene. The Gaussian splatting parameters, however, are never directly optimized.

yangqinhui0423 commented 1 month ago

Hey, I hope I understand your question correctly. We do not optimize the parameters of the Gaussian Splatting. Instead the decoder learns to directly predict the Gaussian splatting parameters using the information of the 3D GAN tri-plane.

This means that we do not have any ground truth parameters during the training. Nevertheless, a loss can still be computed by comparing the similarity of the renderings of the NeRF based 3D GAN and the rendering of the decoded Gaussian splatting scene. The Gaussian splatting parameters, however, are never directly optimized.

Amazing! Thanks for your patient answer. I may understand the process. During the training process, the decoder weights are optimized by calculating the loss function, so that the decoder can gradually predict more accurate Gaussian parameters. During inference, after obtaining the accurate Gaussian parameters through the pre-trained decoder, we can directly render without optimizing the Gaussian parameters. By the way, when training, do you use 360° multi-view images for one face model?

Florian-Barthel commented 1 month ago

Yes thats correct. During inference, no optimization is used at all.

"By the way, when training, do you use 360° multi-view images for one face model?" During training, we do not show the same face more than one time. We have tested this, showing the same ID 4 or 8 times before showing the next ID, however, the performance did not improve.

yangqinhui0423 commented 1 month ago

Yes thats correct. During inference, no optimization is used at all.

"By the way, when training, do you use 360° multi-view images for one face model?" During training, we do not show the same face more than one time. We have tested this, showing the same ID 4 or 8 times before showing the next ID, however, the performance did not improve.

Thank you. That means for each face you only use one picture. And I wonder how to get the SH since I see you omit the estimation of view-dependent SH coefficients in training the decoder.

Florian-Barthel commented 1 month ago

The SH is disabled in our training. We simply set it to 0 across all dimensions. Enabling it could be interesting to test for future work, as the SH might improve the quality of the eye, which move depending on the camera position when rendering with the NeRF renderer.