CUDA Runtime Error: invalid argument after setting cuda device

icedoom888 commented 4 years ago

The following code fails when setting pyredner device:

import pyredner as pyredner
import torch
import matplotlib.pyplot as plt
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"

class Neural_Renderer(torch.nn.Module):
    def __init__(self):
        super(Neural_Renderer, self).__init__()

        self.mesh_path = "/mnt/soarin/data/church_tower/mesh/church_tower.obj"
        self.resolution = [3000, 4000]

    def forward(self):
        objects = pyredner.load_obj(self.mesh_path, return_objects=True)
        object = objects[0]

        camera = pyredner.automatic_camera_placement([object], resolution = self.resolution)
        camera.position += 2* (camera.look_at - camera.position)
        scene = pyredner.Scene(camera = camera, objects = [object])
        img = pyredner.render_albedo(scene)
        print(img.device)

        return img

print('Number of Cuda visible devices: ', torch.cuda.device_count())
device_0 = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device_1 = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
print('Device 0 properties: ', torch.cuda.get_device_properties(device_0))
print('Device 1 properties: ', torch.cuda.get_device_properties(device_1))

pyredner.set_device(device_1)
renderer = Neural_Renderer()
renderer.to(device_1)

im = renderer().cpu()
plt.imshow(im)
plt.show()

With the following trace:

Number of Cuda visible devices:  2
Device 0 properties:  _CudaDeviceProperties(name='GeForce GTX 1080 Ti', major=6, minor=1, total_memory=11178MB, multi_processor_count=28)
Device 1 properties:  _CudaDeviceProperties(name='GeForce GTX TITAN X', major=5, minor=2, total_memory=12209MB, multi_processor_count=24)
Scene construction, time: 0.10353 s
CUDA Runtime Error: invalid argument at /tmp/pip-req-build-ax7g8eqj/buffer.h:55

On the other hand if I use the CUDA_VISIBLE_DEVICES = 1 the same exact code works:

import pyredner as pyredner
import torch
import matplotlib.pyplot as plt
import os
os.environ["CUDA_VISIBLE_DEVICES"]="1"

class Neural_Renderer(torch.nn.Module):
    def __init__(self):
        super(Neural_Renderer, self).__init__()

        self.mesh_path = "/mnt/soarin/data/church_tower/mesh/church_tower.obj"
        self.resolution = [3000, 4000]

    def forward(self):
        objects = pyredner.load_obj(self.mesh_path, return_objects=True)
        object = objects[0]

        camera = pyredner.automatic_camera_placement([object], resolution = self.resolution)
        camera.position += 2* (camera.look_at - camera.position)
        scene = pyredner.Scene(camera = camera, objects = [object])
        img = pyredner.render_albedo(scene)
        print(img.device)

        return img

print('Number of Cuda visible devices: ', torch.cuda.device_count())
device_0 = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print('Device 0 properties: ', torch.cuda.get_device_properties(device_0))

pyredner.set_device(device_0)
renderer = Neural_Renderer()
renderer.to(device_0)

im = renderer().cpu()
plt.imshow(im)
plt.show()

With the following trace:

Number of Cuda visible devices:  2
Device 0 properties:  _CudaDeviceProperties(name='GeForce GTX 1080 Ti', major=6, minor=1, total_memory=11178MB, multi_processor_count=28)
Scene construction, time: 0.03885 s
Forward pass, time: 1.51381 s
cuda:0

Please also note that calling the pytorch standard .to(device) procedure has no effect whatsoever on redner internal state..

BachiLi commented 4 years ago

I'll take a look this weekend. Thanks!

BachiLi commented 4 years ago

Hmm it works fine on my machine and I couldn't reproduce:

Number of Cuda visible devices:  4
Device 0 properties:  _CudaDeviceProperties(name='TITAN Xp', major=6, minor=1, total_memory=12195MB, multi_processor_count=30)
Device 1 properties:  _CudaDeviceProperties(name='TITAN Xp', major=6, minor=1, total_memory=12196MB, multi_processor_count=30)
Scene construction, time: 4.04201 s
Forward pass, time: 1.61628 s
cuda:1

Could it be due to your two GPUs having different compute capabilities? Can you somehow test it?

icedoom888 commented 4 years ago

Hello @BachiLi, thank you for testing this.

I ran some additional tests by running the following code:

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import pyredner as pyredner
import torch
import matplotlib.pyplot as plt

class Neural_Renderer(torch.nn.Module):
    def __init__(self, device):
        super(Neural_Renderer, self).__init__()
        pyredner.set_device(device)
        self.mesh_path = "/mnt/soarin/data/church_tower/mesh/church_tower.obj"
        self.resolution = [3000, 4000]

    def forward(self):
        objects = pyredner.load_obj(self.mesh_path, return_objects=True)
        object = objects[0]
        camera = pyredner.automatic_camera_placement([object], resolution = self.resolution)
        camera.position += 2* (camera.look_at - camera.position)
        scene = pyredner.Scene(camera = camera, objects = [object])
        img = pyredner.render_albedo(scene)
        print(img.device)

        return img

print('Number of Cuda visible devices: ', torch.cuda.device_count())
device_0 = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print('Device 0 properties: ', torch.cuda.get_device_properties(device_0))

renderer = Neural_Renderer(device_0)
im = renderer().cpu()
plt.imshow(im)
plt.show()

If CUDA_VISIBLE_DEVICES=0, the code will run on my GTX 1080 Ti which has a capacity of 11178 MiB: producing the following trace:

Number of Cuda visible devices:  1
Device 0 properties:  _CudaDeviceProperties(name='GeForce GTX 1080 Ti', major=6, minor=1, total_memory=11178MB, multi_processor_count=28)
Scene construction, time: 0.03863 s
Forward pass, time: 1.51581 s
cuda:0

If CUDA_VISIBLE_DEVICES=1, the code will run on my GTX TITAN X which has a capacity of 12209 MiB: producing the following trace:

Number of Cuda visible devices:  1
Device 0 properties:  _CudaDeviceProperties(name='GeForce GTX TITAN X', major=5, minor=2, total_memory=12209MB, multi_processor_count=24)
Scene construction, time: 0.03840 s
CUDA Runtime Error: out of memory at /tmp/pip-req-build-pa0ny0bz/buffer.h:55

Nvidia-smi shows:


Mon Jan 27 10:41:25 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:01:00.0  On |                  N/A |
| 28%   60C    P8    20W / 250W |  10710MiB / 12209MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 27%   32C    P8     7W / 250W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+



- Running on the TITAN X, reducing the image resolution does NOT produce any error.

It is odd that the TITAN X runs out of memory given it has more computation capability and has no initial active processes running on it.

Do you have any suggestions?

BachiLi commented 4 years ago

This has to do with how PyTorch manages the memory as well (e.g., https://discuss.pytorch.org/t/unable-to-allocate-cuda-memory-when-there-is-enough-of-cached-memory/33296). How much memory does the rendering take in the 1080 Ti? (by the way compute capability means something quite different https://stackoverflow.com/questions/11973174/what-does-compute-capability-mean-w-r-t-cuda/11974822)

tetterl commented 4 years ago

I might have an issue related to this one. I get an out of memory error for my (different) code: CUDA Runtime Error: out of memory at /tmp/pip-req-build-dnhy8cqu/src/buffer.h:55 when running with a GeForce GTX TITAN X (Compute capability: 5.2) but not when running with a Titan X (Compute capability: 6.1). Unfortunately I can't provide a reproducible code snippet. Might it be possible that redner-gpu behaves differently for compute capability>=6?

BachiLi commented 4 years ago

Possible. Unified memory works differently for pre-Pascal GPU and the others (https://devblogs.nvidia.com/unified-memory-cuda-beginners/). This is what I was suspecting. I don't have a good solution to this yet.

BachiLi / redner

CUDA Runtime Error: invalid argument after setting cuda device #97