Rendering-at-ZJU / weight-sharing-kernel-prediction-denoising

Source code for "Real-time Monte Carlo Denoising with Weight Sharing Kernel Prediction Network" (EGSR 2021)
MIT License
17 stars 3 forks source link

The train loss is NAN #4

Open jelleopard opened 2 months ago

jelleopard commented 2 months ago

Thanks to the author for the open source code. The following problems occurred during the reproduction process. Please help. Thank you~

The environment used is:

ubuntu cuda 11.8
python                     3.9
torch                        2.3.1+cu118 
openexr                   3.2.4 
scikit-image             0.19.3
scipy                        1.13.1
tensorboard             2.17.0 

Use the following code to generate the corresponding depth map:

scene_names = ["sponza", "classroom", "living-room", "san-miguel", "sponza-glossy", "sponza-moving-light"]
scene_name = scene_names[5]
camera_matrices = np.zeros((60, 4, 4))
with open(os.path.join("dataset", "cameras", scene_name+".h"), "r") as file:
    camera_idx = 0
    row_idx = 0
    for line in file.readlines():
        floats = re.findall(r'-?\d+.\d+', line)
        if len(floats) > 0:
            camera_matrices[camera_idx, row_idx] = np.array(floats, dtype=np.float32)
            row_idx += 1
        if row_idx == 4:
            camera_idx += 1
            row_idx = 0

world_positions = []
for i in tqdm(range(60)):
    world_positions.append(pyexr.read(os.path.join("dataset", scene_name, "inputs", "world_position"+str(i)+".exr")))
H, W = world_positions[0].shape[:2]

depth_buffers = []
world_position = np.ones(4)
# for i in tqdm(range(60)):
for i in range(60):
    depth_buffer = np.zeros((H, W, 1))
    for h_i in range(H):
        for w_i in range(W):
            world_position[:3] = world_positions[i][h_i, w_i]
            depth_buffer[h_i, w_i] = np.dot(world_position, camera_matrices[i][:, 2]) / np.dot(world_position, camera_matrices[i][:, 3])

    print(i, depth_buffer.max(), depth_buffer.min())
    depth_buffer = (depth_buffer - depth_buffer.min()) / (depth_buffer.max() - depth_buffer.min())
    depth_buffers.append(np.concatenate((depth_buffer, depth_buffer, depth_buffer), axis=2))
    pyexr.write(os.path.join("dataset", scene_name, "depth", "depth"+str(i)+".exr"), depth_buffers[i])

Generate acc_colors using data_preprocess.py

The data format is as follows:

dataset
     cameras 
     classroom
          acc_colors
         depth
         inputs
         depth_normalized
         radiance_accum
     living-room
     san-miguel
     sponza
     sponza-glossy
     sponza-moving-light

The dataset.py file is modified as follows:

        scene_names = ["classroom", "living-room", "san-miguel", "sponza-glossy", "sponza"]
        # scene_names = ["classroom-example"]

        img_num_per_scene = 60
        # img_num_per_scene = 5

Running train.py results in the following:

100%|██████████| 300/300 [01:39<00:00,  3.01it/s]
nan
nan
nan
nan
nan
...
nan
nan
nan
nan
nan
Validation:
mean SSIM: nan
mean PSNR: nan

What's the problem? need your help. thx.

hmfann commented 2 months ago

Please manually check if dataset has NaN, And lower the learning rate might be helpful.

jelleopard commented 2 months ago

Please manually check if dataset has NaN, And lower the learning rate might be helpful.

Thanks for your reply. Using your suggestion to lower the learning rate(1e-3-->1e-6) has some effect, but as the number of iterations increases, it is NAN again. I found that NAN appears in the input, which causes NAN to appear in the output. Is there something wrong with the generated depth map?


def train(model, device, dataloader, optimizer, epoch, writer):
    model.train()
    losses = []
    criterion = SMAPELoss().to(device)
    for (inputs, targets) in dataloader:
        optimizer.zero_grad(set_to_none=True)
        inputs = inputs.to(device, non_blocking=True)
        targets = targets.to(device, non_blocking=True)
        outputs = model(inputs)
        loss = criterion(outputs, targets)

        if np.isnan(loss.item()):
                print(torch.isnan(inputs).all()) # False if there is a NAN

        loss.backward()
        optimizer.step()
        losses.append(loss.item())

    writer.add_scalar("Loss/total_train", np.mean(losses), epoch)
    print("Loss: %f" % np.mean(losses))