NVIDIA / MinkowskiEngine

Minkowski Engine is an auto-diff neural network library for high-dimensional sparse tensors
https://nvidia.github.io/MinkowskiEngine
Other
2.43k stars 360 forks source link

Error when assigning TensorField to a specific cuda device #465

Open daniCh8 opened 2 years ago

daniCh8 commented 2 years ago

Describe the bug When creating a TensorField, the object uses memory in the first available device even after specifying the device id in the constructor. This makes it impossible to use models stored in all the devices which are not the first one.


Code To Reproduce

import numpy as np
import MinkowskiEngine as ME
from MinkowskiEngine import TensorField
from MinkowskiEngine.utils import sparse_quantize, sparse_collate

device = 'cuda:5'

rcoords = np.linspace(0, 100, 75000*6, dtype=np.float).reshape(-1, 75000, 3)
rfeats = np.random.rand(2, 75000, 4)

c1, f1 = sparse_quantize(coordinates=rcoords[0], features=rfeats[0])
c2, f2 = sparse_quantize(coordinates=rcoords[1], features=rfeats[1])
c, f = sparse_collate(coords=[c1, c2], feats=[f1, f2])
c = c.to(device)
f = f.to(device)

tensor_field = TensorField(coordinates=c, features=f, device=device)

conv = ME.MinkowskiConvolution(
    4,
    64,
    kernel_size=3,
    stride=2,
    dimension=3,
).to(device).double()

conv(tensor_field.sparse())

Expected behavior A clear and concise description of what you expected to happen.

The tensor_field above is created using the first available GPU. Calling tensor_field.device will return the device i set in the constructor (cuda:5); however, when checking the GPUs memory, there is memory utilization in the GPU 0 triggered by the TensorField constructor (see attached picture - snapshot of the memory status after running the code above, 3632 is the id of the script). The output of the code above is the following exception, which proves again that the tensor_field object is not fully stored in cuda:5 as requested:

RuntimeError: /tmp/pip-req-build-bhp9c3al/src/convolution_gpu.cu:66, assertion (at::cuda::check_device({in_feat, kernel})) failed. in_feat and kernel must be on the same device

Desktop:

==========System==========
Linux-5.4.0-107-generic-x86_64-with-glibc2.17
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS"
3.8.12 (default, Oct 12 2021, 13:49:34) 
[GCC 7.5.0]
==========Pytorch==========
1.10.2
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 510.47.03
CUDA Version 11.6
VBIOS Version 94.02.26.08.1C
Image Version G001.0000.03.03
GSP Firmware Version N/A
==========NVCC==========
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
==========CC==========
/usr/bin/c++
c++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine==========
0.5.4
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11060
CUDART version MinkowskiEngine is compiled: 11060

gpu-devices-utilization

TB5z035 commented 2 years ago

Same error here. Passing quantization_mode=ME.SparseTensorQuantizationMode.RANDOM_SUBSAMPLE as an argument when constructing TensorField seems to be a workaround

daniCh8 commented 2 years ago

Thanks for sharing! My workaround so far has been to launch any script that uses TensorFields with CUDA_VISIBLE_DEVICES=<target_device_id>.

TB5z035 commented 2 years ago

Update: using torch.cuda.set_device(device_index) would be a better practice.

It seems that the author uses Tensor.cuda() instead of manually setting the device by Tensor.to(device), so it is necessary to specify a default target device