andrewhou1 / GeomConsistentFR

Official Code for Face Relighting with Geometrically Consistent Shadows (CVPR 2022)
https://openaccess.thecvf.com/content/CVPR2022/html/Hou_Face_Relighting_With_Geometrically_Consistent_Shadows_CVPR_2022_paper.html
MIT License
114 stars 17 forks source link

How to generate depth maps #6

Closed rshiv2 closed 1 year ago

rshiv2 commented 1 year ago

How did you generate depth maps for the CelebA-HQ dataset? In your paper, you mention a Deep Multi-View Stereo approach, but the model accompanying this approach seems to require at least two views of the same face in order to construct a 3D Morphable Model.

rshiv2 commented 1 year ago

I've tried running Bai et. al. using two copies of the same image, and the result has a lot of artifacts. However, if I use two different views of the same subject, I get pretty good 3D morphable models.

andrewhou1 commented 1 year ago

That model can also produce a 3D face given a single image (in their code you can specify the same image twice in the image list as you mentioned). The quality of the 3D face will of course improve if you have two or more images from different views serving to constrain the shape instead of just a single view but to use CelebA-HQ for training we only have a single view available. The reason we use this method for our depth map estimation is that we further apply it to estimate the groundtruth 3D shape for our surface normal evaluations on Multi-PIE using 3 views at a time, so we wanted to keep the method we were using consistent.

rshiv2 commented 1 year ago

Ah, I see. It's strange, though, when I try to run Bai et. al. on two copies of a CelebA-HQ image I get:

vis_view0

...agh, that's terrifying. I might have set up the Deep Multi-view code incorrectly?

It was honestly kind of a pain to get that codebase set up in the first place - do you think it would be fairly easy to use depth estimates from a different network to supervise the depth decoder? I'm just worried that, in doing so, I'll mess up the raytracing part of the code. The various depth estimators I've looked at use different units for measuring depths (and, occasionally, different coordinate spaces). If I use a new depth teacher, I'm worried that the raytracing portion of the network will produce crazy unrealistic shadows. Have you experimented with this? Thanks!

andrewhou1 commented 1 year ago

0_vis_view0 So this is what their model produces on my end, maybe it is indeed a difference in how we set up their codebase. Are you interested in retraining the model? If so I've provided the depth maps already for CelebAHQ in the google drive link that contains the training data on this repo (see the README). What's missing is the original CelebAHQ images, which of course must be downloaded on your own and cropped.

rshiv2 commented 1 year ago

Yes, I am trying to retrain the model, but not using exactly the same ground truth that you provide. For context, I tried runnin ga pre-trained network on some images with low light / exposure. It turns out that when you feed such images into the model, you get really poor depth estimates that have ridges and streaks all over the place. This, in turn, leads to ridge- and streak- like shadows. I had two ideas for getting around this:

1) create an augmented, low-light version of the CelebA-HQ dataset, and train the model to be able to handle low-light inputs in the first place. 2) supervise depth estimates with higher-quality ground truth. I took a look at the ground truth depth maps that you provided in the Google Drive, and noticed that they were more coarse than the one presented in Bai et. al, so I thought I'd try using Bai et. al. to generate a new, finer-grained set of depth masks.

I'm trying both of these in parallel, but ran into some issues with approach 2), which is why I opened this issue.

andrewhou1 commented 1 year ago

Got it, that makes sense. So I generate the depth maps from the predicted 3D face by using z-buffering to compute the minimum depth at each pixel. The reason the depth maps might seem coarse is likely due to the image resolution that our model is trained on (256x256) which means the depth maps are the same resolution. I guess if it's difficult to get similar results from Bai et al. as what I presented in my previous comment, option 1 is doable. You could use the same groundtruth depth maps as what I provided in the drive link while synthesizing each CelebA-HQ image under low light. That way the network could learn to handle the low light images as you mentioned and still estimate appropriate depth maps.