--device option not passed down to tiny-cuda-nn

jennydaman commented 1 year ago

nesvor has the option --device to specify which GPU is to be used by nesvor. However, the GPU selection is not passed down to the tiny-cuda-nn functions, which will always attempt to use GPU#0.

We have a machine with 2 GPUs. GPU#0 is currently being used, so we try to run nesvor --device 1. However, the following exception occurs:

Traceback (most recent call last):
  File "/opt/conda/bin/nesvor", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/nesvor/cli/main.py", line 23, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/nesvor/cli/main.py", line 48, in run
    getattr(commands, command_class)(args).main()
  File "/opt/conda/lib/python3.10/site-packages/nesvor/cli/commands.py", line 75, in main
    self.exec()
  File "/opt/conda/lib/python3.10/site-packages/nesvor/cli/commands.py", line 162, in exec
    model, output_slices, mask = train(input_dict["input_slices"], self.args)
  File "/opt/conda/lib/python3.10/site-packages/nesvor/inr/train.py", line 36, in train
    model = NeSVoR(
  File "/opt/conda/lib/python3.10/site-packages/nesvor/inr/models.py", line 309, in __init__
    self.build_network(bounding_box)
  File "/opt/conda/lib/python3.10/site-packages/nesvor/inr/models.py", line 350, in build_network
    self.inr = INR(bounding_box, self.args, self.spatial_scaling)
  File "/opt/conda/lib/python3.10/site-packages/nesvor/inr/models.py", line 147, in __init__
    self.encoding = build_encoding(
  File "/opt/conda/lib/python3.10/site-packages/nesvor/inr/models.py", line 48, in build_encoding
    raise e
  File "/opt/conda/lib/python3.10/site-packages/nesvor/inr/models.py", line 39, in build_encoding
    encoding = tcnn.Encoding(
  File "/opt/conda/lib/python3.10/site-packages/tinycudann/modules.py", line 315, in __init__
    super(Encoding, self).__init__(seed=seed)
  File "/opt/conda/lib/python3.10/site-packages/tinycudann/modules.py", line 161, in __init__
    initial_params = self.native_tcnn_module.initial_params(seed)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I used nvtop to monitor GPU usage during the runtime of NeSVoR.

Screenshot 2023-06-29 at 14 36 31

nesvor started around -110s
nesvor crashed around -60s
we can observe a bump in GPU0 memory consumption right before nesvor crashes

daviddmc commented 1 year ago

Currently, the tinycudann modules are first allocated on the default GPU (cuda:0) and then moved to the desired device, e.g., (cuda:1). This won't work if cuda:0 has no enough memory. So I need to change the default GPU before creating modules.

daviddmc commented 1 year ago

Another solution (which is actually recommended by PyTorch) is to set the environment variable CUDA_VISIBLE_DEVICES='1' and then run the algorithm with device=0.

daviddmc / NeSVoR

--device option not passed down to tiny-cuda-nn #14