Bug at render_pathtracing? It often stops in the middle of rendering

dkasuga commented 4 years ago

I "pip install"ed redner-gpu to my python environment. There's basically no problem, but render_pathtracing operation often stops in the middle, especially when I'm using the function iteratively (using pyredner as neural network module). This problem didn't appear when I used pyredner on the provided docker-image.

Does anybody have the same problem? My system is 18.04.3 LTS (Bionic Beaver) 4.15.0-60-generic, NVIDIA Driver Version: 410.129.

dkasuga commented 4 years ago

As a result of some trials and errors. I found that the number of meshes, pathtracing samples, or batch size have large influence on whether rendering stops or not. In my cases, I tried to render 8(batch size) scenes each of which has 7920 vertices at 256 rendering samples many iterations. This placed high load on CPUs.

So now I've just tried changing parameters such as batch size or rendering samples, but again the rendering stops at the middle of training, although there was some improvement (bedore: at 1-2 iteration, after:15-16 iteration). Do I have to do some kind of memory release processing?

My codes:


'''
It is a simple neural network just like Auto Encoder, which inputs an image of sphere with material and outputs material parameters. In this code, I focus only on diffuse_reflectance parameters.
'''
import torch
import pyredner
import numpy as np

from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import torchvision.utils as vutils
from torchvision import datasets, models, transforms
from PIL import Image

import sys
from tqdm import tqdm
import argparse
import os.path
import random
import time
from torch.utils.tensorboard import SummaryWriter

from my_models import MyResnet, Model

def make_obj_and_cam(d_reflectance):
    vertices, indices, uvs, normals = pyredner.generate_sphere(theta_steps=64,
                                                               phi_steps=128)
    diffuse_reflectance = d_reflectance
    m = pyredner.Material(diffuse_reflectance=d_reflectance,
                          specular_reflectance=torch.tensor(
                              (0.2, 0.2, 0.2), device=pyredner.get_device()),
                          roughness=torch.tensor([0.001],
                                                 device=pyredner.get_device()))
    obj = pyredner.Object(vertices=vertices,
                          indices=indices,
                          uvs=uvs,
                          normals=normals,
                          material=m)
    cam = pyredner.automatic_camera_placement([obj],
                                              resolution=args.resolution)
    return obj, cam

def train(args, envmap, model, optimizer):
    for iteration in tqdm(range(args.total_iter)):
        input_scenes = []
        for batch in range(args.batch_size):
            diffuse_reflectance = torch.rand(3, device=pyredner.get_device())
            obj, cam = make_obj_and_cam(diffuse_reflectance)
            scene = pyredner.Scene(objects=[obj], camera=cam, envmap=envmap)
            input_scenes.append(scene)

        input_imgs = pyredner.render_pathtracing(scene=input_scenes,
                                                 num_samples=args.samples)
        input_imgs = torch.pow(input_imgs, 1.0 / 2.2)  # gamma correction
        input_imgs = input_imgs.permute(0, 3, 1, 2)  #NxCxHxW

        # Estimation
        output_params = model(input_imgs)

        output_scenes = []
        for batch in range(args.batch_size):
            diffuse_reflectance = output_params[batch]
            obj, cam = make_obj_and_cam(diffuse_reflectance)
            scene = pyredner.Scene(objects=[obj], camera=cam, envmap=envmap)
            output_scenes.append(scene)

        try:
            output_imgs = pyredner.render_pathtracing(scene=output_scenes,
                                                      num_samples=args.samples)
        except:
            print("error")
            print("output_params:{}".format(output_params))

        output_imgs = torch.pow(output_imgs, 1.0 / 2.2)  # gamma correction
        output_imgs = output_imgs.permute(0, 3, 1, 2)  #NxCxHxW

        #loss = nn.MSELoss()(output_imgs, input_imgs)
        loss = (output_imgs - input_imgs).pow(2).sum()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if iteration % 10 == 0:
            saved_imgs = torch.cat(
                [input_imgs.detach().cpu(),
                 output_imgs.detach().cpu()], dim=2)
            vutils.save_image(saved_imgs,
                              os.path.join(args.output_dir,
                                           f'logs/saved_imgs_{iteration}.jpg'),
                              nrow=(args.batch_size))
            writer.add_images('saved_imgs', saved_imgs, iteration)

            print(f"loss:{loss}")
            writer.add_scalar("loss", loss, iteration)

if __name__ == '__main__':

    from datetime import datetime
    date_time = datetime.now()
    log_folder = 'trial_%s_%d_%d_%d' % (date_time.date(), date_time.hour,
                                        date_time.minute, date_time.second)
    os.mkdir(log_folder)
    os.mkdir(log_folder + '/logs')

    parser = argparse.ArgumentParser(
        description='diffuse_reflectance learning')

    parser.add_argument('--trial_name',
                        type=str,
                        default="diffuse_learning_test1",
                        help='a brief description of the training trial')
    parser.add_argument('--output_dir',
                        default=log_folder,
                        help='output directory')
    parser.add_argument('--gpu_id',
                        type=int,
                        default=0,
                        help='0 is the first gpu, 1 is the second gpu, etc.')
    parser.add_argument('--resolution',
                        type=int,
                        default=224,
                        help='image resolution')
    parser.add_argument('--samples',
                        type=int,
                        default=64,
                        help='rendering samples')
    parser.add_argument(
        '--lr',
        type=float,
        default=0.001,
        help=
        'learning rate, default is 1e-3, usually dont need to change it, you can try make it bigger, such as 2e-3'
    )
    parser.add_argument(
        '--batch_size',
        type=int,
        default=8,
        help='how many images to train together at one iteration')
    parser.add_argument(
        '--total_iter',
        type=int,
        default=200,
        help=
        'how many iterations to train in total, the value is in assumption that init step is 1'
    )

    args = parser.parse_args()

    print(str(args))

    args.resolution = (args.resolution, args.resolution)

    trial_name = args.trial_name
    device = torch.device("cuda:%d" % (args.gpu_id))
    pyredner.set_use_gpu(torch.cuda.is_available())
    pyredner.set_device(device)

    writer = SummaryWriter(log_dir=f"{args.output_dir}/tensorboard")

    envmap_img = pyredner.imread('./grace-new.exr')
    envmap_img.to(device)
    envmap = pyredner.EnvironmentMap(envmap_img * 10.0)

    model = MyResnet(dim_out=3)
    model = model.to(device)
    model.train()

    optimizer = torch.optim.Adam(model.parameters(),
                                 lr=args.lr,
                                 betas=(0.0, 0.99))

    train(args, envmap, model, optimizer)

BachiLi commented 4 years ago

I haven't tried your code but I just fixed a related memory bug in 0.1.30 (https://github.com/BachiLi/redner/commit/042aff925207ef3fcc0c163d01d9dcd2c86452cb), can you try again?

BachiLi commented 4 years ago

Another thing to look at is the GPU memory usage. Maybe monitor nvidia-smi when your code is running.

dkasuga commented 4 years ago

Thanks for your prompt response! I updated redner-gpu and tried again, but there still remains the same problem.... As you said, I monitored nvidia-smi and I found that every time the process stopped, Volitile GPU-Util stops at 100%. There is room in memory usage, on the other hand.

スクリーンショット 2019-12-16 17 06 16

BachiLi commented 4 years ago

Can you share my_models.py with me? Let's see if I can reproduce this.

BachiLi commented 4 years ago

dkasuga commented 4 years ago

Thank you very much for your kindness and I'm sorry for bothering you. I'll share my_modes.py : (but to be honest I don't think the cause of the problem is MyResnet, because the process always stops in the middle of forward rendering)

import torch
from torch import nn
import torch.nn.functional as F
from torchvision import models

class MyResnet(nn.Module):
    def __init__(self, dim_out=7):
        super(MyResnet, self).__init__()
        resnet = models.resnet18(pretrained=True)
        self.resnet = nn.Sequential(*list(resnet.children())[:-1])
        num_fc_in_features = resnet.fc.in_features
        self.fc = nn.Linear(num_fc_in_features, dim_out)

    def forward(self, x):
        x = self.resnet(x)
        x = x.squeeze()
        x = self.fc(x)
        x = torch.tanh(x)  #[-1,1]
        x = (x + 1.0) / 2.0  #[0,1]
        return x

In addition, I modified a little the code in the previous post. Please pay your attention to this part of the code:

vertices, indices, uvs, normals = pyredner.generate_sphere(theta_steps=64,
                                                               phi_steps=128)

I assume the cause of the problem is that the number of meshes is too large. In fact, when I set theta_steps=8, phi_steps=16, the problem doesn't happen. However, in practice, I want to deal with more complicated objects which have a lot of meshes(shapenet, etc). Actually I tried to apply the rendering system to stanford bunny, just as you did in your tutorial. However, the rendering usually stops in the middle in this case, too.

# stanford bunny version
def make_obj_and_cam(d_reflectance):
    objects = pyredner.load_obj('bunny/bunny.obj', return_objects=True)
    for obj in objects:
        obj.material = pyredner.Material(diffuse_reflectance = d_reflectance,
                                                 specular_reflectance = torch.tensor((0.2, 0.2, 0.2), device = pyredner.get_device()),
                                                 roughness = torch.tensor([0.001], device = pyredner.get_device()))
    cam = pyredner.automatic_camera_placement(objects,
                                              resolution=args.resolution)
    return objects, cam

Thank you.

dkasuga commented 4 years ago

I executed the training with `--batch_size 8 --samples 64'.

BachiLi commented 4 years ago

I'm still running (at iteration 13 without crash) but want to point out that the dim_out parameter in your MyResnet should be 3.

BachiLi commented 4 years ago

Another small tip: you want to avoid directly using torch.pow(img, 1.0/2.2), since the derivative of this at 0 is negative infinite. You might want something like torch.pow(img.max(0.0), 1.0/2.2).

BachiLi commented 4 years ago

Hmm, I couldn't reproduce this. I was able to run for more than 100 iterations without issue. Are you sure you didn't get any error messages?

dkasuga commented 4 years ago

I have good news! I updated redner-gpu from 0.1.31 to 0.1.32, and finally the problem disappeared. I don't understand what the cause was, but I'm sure some difference between the two versions is important to (at least) my environment. In fact, the problem still happens in the conda virtual environment with redner-gpu 0.1.31.

Anyway, thank you for fixing the bug and debugging my code! I hope my case will be useful for the update in the future!

BachiLi / redner

Bug at render_pathtracing? It often stops in the middle of rendering #80