Checkpoint issue and bad render quality after loading the checkpoints....

ZhenhuiL1n commented 11 months ago

HI, I found that there is a weird problem in the checkpoint module:

RuntimeError: Error(s) in loading state_dict for StrivecCP_hier: size mismatch for density_line.0: copying a param with shape torch.Size([9, 32, 121]) from checkpoint, the shape in current model is torch.Size([9, 32, 122]). size mismatch for density_line.1: copying a param with shape torch.Size([9, 32, 121]) from checkpoint, the shape in current model is torch.Size([9, 32, 122]). size mismatch for density_line.2: copying a param with shape torch.Size([9, 32, 121]) from checkpoint, the shape in current model is torch.Size([9, 32, 122]). size mismatch for density_line.3: copying a param with shape torch.Size([44, 24, 61]) from checkpoint, the shape in current model is torch.Size([44, 24, 62]). size mismatch for density_line.4: copying a param with shape torch.Size([44, 24, 61]) from checkpoint, the shape in current model is torch.Size([44, 24, 62]). size mismatch for density_line.5: copying a param with shape torch.Size([44, 24, 61]) from checkpoint, the shape in current model is torch.Size([44, 24, 62]). size mismatch for app_line.0: copying a param with shape torch.Size([9, 48, 121]) from checkpoint, the shape in current model is torch.Size([9, 48, 122]). size mismatch for app_line.1: copying a param with shape torch.Size([9, 48, 121]) from checkpoint, the shape in current model is torch.Size([9, 48, 122]). size mismatch for app_line.2: copying a param with shape torch.Size([9, 48, 121]) from checkpoint, the shape in current model is torch.Size([9, 48, 122]). size mismatch for app_line.3: copying a param with shape torch.Size([44, 48, 61]) from checkpoint, the shape in current model is torch.Size([44, 48, 62]). size mismatch for app_line.4: copying a param with shape torch.Size([44, 48, 61]) from checkpoint, the shape in current model is torch.Size([44, 48, 62]). size mismatch for app_line.5: copying a param with shape torch.Size([44, 48, 61]) from checkpoint, the shape in current model is torch.Size([44, 48, 62]).

Size mismatch when loading a saved checkpoint, the config file is not modified when loading the checkpoint. However, it can work if I modify the local_dims_file from [121, 121, 121, 61, 61, 61, 31, 31, 31] to [120, 120, 120, 60, 60, 60, 31, 31, 31]. But it just help avoid the problem not solve it ...... Also, the rendered image quality is not as good as the image rendered during the training process(novel views.) I think the problem may be these dimension mismatch things........

This is the image output(render only) using the checkpoint after 30000 iterations.... 020

This is the image output during the test time inside the training process(even just 4k iterations): 003999_020

Zerg-Overmind commented 11 months ago

Hi Zhehui, thank you very much for the sharing. This is indeed a bug that I just discovered, which is a numerical issue and now you may pull the latest file train_hier.py to your local device to solve this issue.

The error is here: L154. After using the torch.floor as, i.e., torch.floor([61.000]), it somehow becomes torch.tensor([60]). And I solved it by simply adding a small number 0.01. Please let me know if there is any other issues !

ZhenhuiL1n commented 10 months ago

Hi, the mismatch of the network layer is solved however the point clouds surrounding around the person is not change when I loaded the trained ckpt and rendered the novel view, but before saving the checkpoint, I rendered the novel view in the testing process, the point cloud will not appear, I think there may be some compression when saving the ckpt.... This also happened when I reloaded the original NeRF synthetic dataset to render novel views....

ZhenhuiL1n commented 10 months ago

Btw, Thanks a lot for responding so fast and trying to maintain the repo, it is a wonderful project!!!

Zerg-Overmind commented 10 months ago

Hi, what do you mean by "point cloud"? Is that the noise shown in the figure you put here? Actually I never tried to load the checkpoint to do evaluation :) because (you might noticed that) evaluation is just followed after the training process and the "render_only" option was the one I "borrowed" from TensoRF directly without any modification :( . Ideally, there shouldn't be any difference between these two but I haven't throughly went to the details about this part. So I highly recommend you to do training and evaluation in one-shot instead of sololy runing the evaluation part after training for now.

Thanks again for runing and checking the code and I will definitely double check our code once I am available.

ZhenhuiL1n commented 10 months ago

Yes, I also noticed that You can check the first image and second image, the first one is rendered using the ckpt and the second one is rendered without ckpt(after training, the test process). In the first one, the person is surrounded by some weird cloud that didn't appear when I rendered after the training(without loading the ckpt). Maybe there is some compression or some pre-processing missing. I will also try to fix it these days. Thanks for the reply~

Zerg-Overmind / Strivec

Checkpoint issue and bad render quality after loading the checkpoints.... #2