facebookresearch / StyleNeRF

This is the open source implementation of the ICLR2022 paper "StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis"
956 stars 92 forks source link

No 3D-Awareness #19

Open fentopa opened 2 years ago

fentopa commented 2 years ago

Hi,

thanks for the work! I have trained the model on FFHQ (default settings) and another dataset you did not use and which is not public yet. I have tuned a bit the camera parameters for the other dataset. I have observed that the image quality is great but when sampling from different camera poses with render_rotation_camera, it is basically always the middle view. So there is no 3D rotation. In the paper you mentioned that you observed that this might happen due to only using NeRF path regularization. Actually with some training seeds it works a bit (still not as good as the images you have shown) and with some not at all. So it is also unstable. Any ideas to prevent this from happening, especially when using a new dataset? So do I have to be careful with specific parameters like any of the camera parameters?

MultiPath commented 2 years ago

Can you comment the training script? I can re-run on my side. I did not find this issue serious previously.

fentopa commented 2 years ago

Thanks for the quick answer! I used the normal training command (python run_train.py outdir=${OUTDIR} data=${DATASET} spec=paper512 model=stylenerf_ffhq) and using the normal code. On FFHQ it is ok, so no need to rerun, but on another dataset I am using I mainly get flat outputs (since the beginning of training throughout the whole training progress). So no 3D-awareness. Is there any advice to prevent this from happening on new datasets? Maybe you encountered during developing the architecture that some things lead to flat outputs, like specific parameters etc.?

MultiPath commented 2 years ago

Is that dataset public available?

Basically in this version we may still need to manually set the hyper-parameters of the cameras. In config you can see range_u range_v which defines the camera distribution.

fentopa commented 2 years ago

Unfortunately it is not publicly available. The dataset only has faces, so I thought FFHQ parameters might work good. I have already increase range_u to -0.6 0.6 which helped a little bit (more than that led to bad results) , but outputs are still much more flat than with FFHQ. But I wasn't sure how to tune the parameters (how should I set u and v?), if you can help me out here to improve the 3D-awareness with your model. I have to say that I use lower resolution downsampled images (64x64) and only use 64dim latent code and hidden dimension to reduce the model capacity. Is that maybe a problem as well?

MultiPath commented 2 years ago

Did you also try uniform distribution instead of Gaussian for the camera?

fentopa commented 2 years ago

Not yet, I will try it out thanks! I will get back when I tried this. So you think it has to be down to the camera parameters?

MultiPath commented 2 years ago

Based on my experience camera matters a lot. I also have another configuration which contains unpublished codes which might be helpful. But we can see if tuning the camera will help or not

fentopa commented 2 years ago

I played around with the camera parameters, also uniform distribution did not help. I have the feeling that when learning on the coarsest scale the 3D-awareness is fine but then when the progressive training continues on the finer scales, the 3D-awareness gets worse. Did you have similar experience? Is it because the nerf path regularization is only done on the coarsest scale? I am training on 32x32 and then 64x64, so only training on two scales progressively, using less than 32x32 did not help. Don't you think it is related to the nerf path regularization?

Edit: Or is the reason the 2D upsampling? Because I am leaving resolution_vol on 32, so this might introduce the problem. When training in the first step on 32x32, the results are good, then when continuing on 64x64 the 3D-awareness is gone and I have flat outputs. Also directly training on 64x64 leads to bad results. Does it make sense to use more n_reg_samples or train longer on 32x32? I am afraid when I will go even beyond 64x64 it will get worse. It seems like the first stage works fine but the nerf path regularization does not prevent the flat outputs. I also do not have enough memory to use more than resolution_vol=32

KyriaAnnwyn commented 2 years ago

@fentopa how long did it take to train on FFHQ? Could you please share your trained pkl?

I also have the same issue on different dataset. I guess it could be because ffhq has a lot of different poses of the person. Mine has mostly frontal images.