LinjieLyu / NRTF

60 stars 3 forks source link

Confusion about envmaps #5

Closed EliaFantini closed 1 year ago

EliaFantini commented 1 year ago

Hello, I'm having problems understanding the convention used with envmaps. In mitsuba_material.py the envmap gets clamped in the range [0,1000], then in olat training olat envmaps are generated with the white pixel having value 100, and in train_joint.py the envmap gets clamped in the range [0,100]. Finally, in the relight.py example, test.exr is used to relight, but such file is not provided. I am trying to reproduce results shown in the paper with nerfactor dataset, but they provide hdr envmaps (and 16bit png as training data, not .exr) so I'm not sure how to convert them since hdr envmaps all have different values' ranges. I'm also trying to use nvdiffrecmc (nvidia's paper) to do the initial mesh+material+envmap estimation, but they output the envmap as an hdr file, so again I'm not sure how to correctly convert that in a 32x16 tensor that gets clamped in the range [0,100] without something going wrong. In fact, I'm currently experiencing very bright and weird illumination when relighting with hdr envamps provided in the nerfactor dataset. Joint training instead gives very good results with the same illumination as the training images, which makes me think there's something off in how illumination is learned, which is why I'm investigating possible envmaps problems

LinjieLyu commented 1 year ago

Hello. Have you tried to normalize the test envmap? For example, if you have a pair of GT envmap and learned envmap on the same scene, for every novel test envmap, you should scale the test envmap as test_envmap(mean of learned_envmap/mean of gt_envmap) on each RGB channel. The illumination range problem arises from the ambiguity in (illuminationPRT=radiance). What we learn is the scaled illumination and PRT by an unknown factor k and 1/k respectively. To get reasonable relighting you should first linearly map the test envmap from the real space to the learned space. And you can approximate this linear mapping by (mean of learned_envmap/mean of gt_envmap) . In NeRFactor, they do a similar mapping of the estimated albedo w.r.t. GT albedo. Since we don't have GT PRT, we choose to scale the envmap.

EliaFantini commented 1 year ago

Thanks! I will try it and let you know if it solved the problem

EliaFantini commented 1 year ago

It doesn't work, I guess I'm doing some other mistakes previously. Since you tested on nerfactor's data as well, as shown in the paper's supplemental document, wouldn't it be possible to have the code for it? If I understood correctly, current code on github is not compatible with those experiments. I was also wondering, if to do relighting we need this "mean of learned_envmap/mean of gt_envmap" ratio, what about real captures where we don't have a GT envmap? How does that work then? Also why normalization is not applied in relight.py when loading test.exr?

LinjieLyu commented 1 year ago

Can you post the optimized envmaps from Mitsuba2 for more clues?
I didn't add the script for normalization, since it only works for synthetic scene where one can have the GT envmap (guess I can include this feature). For real world capture, we follow NeRFactor to simply scale the maximum value, followed by a gamma correction.

EliaFantini commented 1 year ago

As said before, I'm trying to replace the first part of mesh+material+envmap estimation with nvdiffrecmc from nvidia's paper (https://github.com/NVlabs/nvdiffrecmc), they output the envmap as an hdr file (github does not support hdr files upload, so I zipped it probe.zip ) and I completely skip the mitsuba optimization. Some stats about the envmap I obtain from nvdiffrecmc, after using torch.nn.functional.interpolate(envmap_train, [16, 32], mode='bilinear') to downscale it (original size is 512x1024); for channels R,G,B respectively: mean (0.1945,0.1769,0.1678), std(0.3535,0.3594,0.3771), max(3.67,3.76,3.25). When running train_joint.py, should I leave the envmap as it is with the stats I just mentioned or should I scale it/normalize it? While trying to find the problem, I also noticed that blender makes OLAT generations darker when the pixel of the olat_envmap is closer to the top, because the pixel gets shrinked towards the pole of the emitting environment's sphere and hence the emitting surface is reduced. Is that ok? I've checked and it happens the same with your example.blend file you uploaded so I guess it is intended to be like that, but this way won't the MLP learn a darker illumination for envmap's pixels in the first rows?

Another question, sorry for the long list of questions, how do you load nerfactor's training images since they're png and in the provided code you use only .exr files? Do you apply srgb to rgb conversion?

LinjieLyu commented 1 year ago

I haven't played with nvdiffrecmc yet. There are several things to notice, in my humble opinion:

  1. make sure that nvdiffrecmc and blender share the same envmap coordinate convention. E.g. XYZ to UV mapping for the envmap. I know mitsuba2 and blender use different conventions so I have to input a rotation matrix in mitsuba2.

  2. the envmap estimation is not too off, as a good initialization for joint training. We use mitsuba2, which is global-illumination-aware and physically based, with some very strict assumptions. You should check how close the estimated envmap from nvdiffrecmc is to the GT envmap. If the estimated envmap is too bad, there is no guarantee that the relighting would work. (Think of that, it means you compensate the learned PRT to match the training view, under the wrong envmap)

  3. should I leave the envmap as it is with the stats I just mentioned or should I scale it/normalize it? In theory, you should leave it untouched, as long as different rendering systems share the same tone mapping. For instance, if you estimated the material and envmap from the LDR images which are tone mapped before, but the OLAT images rendered from blender are HDR, you may get the problem since your estimated illumination and OLAT component are not in the same linear spaces. My suggestion: always leave the training images and intermediate rendered OLAT images in the same space, and keep your differentiable renderer and blender settings matched.

  4. How do you load nerfactor's training images since they're png and in the provided code you use only .exr files? Actually, NeRF synthetic scene provides the blender file. You can render the scene from the same training camera view and envmap, and output the images as .exr. Of course, you can also convert srgb to rgb, but I guess it's not trivial to restore HDR images from LDR images.

  5. In the end, remember to scale the test envmap, not the opimized envmap.

EliaFantini commented 1 year ago

Thanks a lot, your last message helped me to find out the problems! There were other "smaller" problems (like running the olat training on .exr olat images instead of sRGB png files as the training images for the joint training, and others), but mostly it was that the envmap learned by nvdiffrecmc was not perfect, and the learning rate for the envmap's Adam in train_joint.py was too small for the envmap to actually change, so as you said " it means you compensate the learned PRT to match the training view, under the wrong envmap". So I solved the other problems, increased the learning rate from 2e-4 to 2e-3 and now it works without multiplying the test envmaps by "mean of learned_envmap/mean of gt_envmap". If I do multiply test envmaps for that ratio, results are wrong instead. I attach some final results. A last question, is there a reason why default iteration values are 150k on both olat and joint trainings? The following results, obviously worse than the ones in the paper, were obtained with 30k iters of olat train (batch 1) and 16k iterations of joint training (batch 250) 2022-12-04_14-41_val_7_sunset 2022-12-04_14-41_val_6_forest 2022-12-04_14-41_val_6_night 2022-12-04_14-41_val_2_city

LinjieLyu commented 1 year ago

Nice! I think the iteration number differs depending on the scenes. For an object with a more specular material and therefore the dense view-dependent effect (e.g. reflection, specularity, color bleeding), it may take a longer time to converge. For a rather diffuse object, the training may converge faster.