TRI-ML / packnet-sfm

TRI-ML Monocular Depth Estimation Repository
https://tri-ml.github.io/packnet-sfm/
MIT License
1.23k stars 242 forks source link

RuntimeError: CUDA error: no kernel image is available for execution on the device #191

Open ArduinoHocam opened 2 years ago

ArduinoHocam commented 2 years ago

Hi, I've used your docker image, after the installation and succesfull docker build, I tried to run this: download a tiny subset of KITTI curl -s https://tri-ml-public.s3.amazonaws.com/github/packnet-sfm/datasets/KITTI_tiny.tar | tar xv -C /data/datasets/ in docker make docker-run COMMAND="python3 scripts/train.py configs/overfit_kitti.yaml" However I get following error: ########################################################################################################################

Config: configs.default_config -> configs.overfit_kitti.yaml

Name: default_config-overfit_kitti-2021.11.25-11h46m13s

########################################################################################################################

0%| | 0/5004 [00:00<?, ? images/s] Traceback (most recent call last): File "scripts/train.py", line 64, in train(args.file) File "scripts/train.py", line 59, in train trainer.fit(model_wrapper) File "/workspace/packnet-sfm/packnet_sfm/trainers/horovod_trainer.py", line 63, in fit self.train(train_dataloader, module, optimizer) File "/workspace/packnet-sfm/packnet_sfm/trainers/horovod_trainer.py", line 90, in train output = module.training_step(batch, i) File "/workspace/packnet-sfm/packnet_sfm/models/model_wrapper.py", line 186, in training_step output = self.model(batch, progress=self.progress) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/workspace/packnet-sfm/packnet_sfm/models/SelfSupModel.py", line 83, in forward output = super().forward(batch, return_logs=return_logs) File "/workspace/packnet-sfm/packnet_sfm/models/SfmModel.py", line 117, in forward depth_output = self.compute_depth_net(batch, force_flip=force_flip) File "/workspace/packnet-sfm/packnet_sfm/models/SfmModel.py", line 85, in compute_depth_net output = self.depth_net_flipping(batch, flag_flip_lr) File "/workspace/packnet-sfm/packnet_sfm/models/SfmModel.py", line 78, in depth_net_flipping output = self.depth_net(batch_input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/workspace/packnet-sfm/packnet_sfm/networks/depth/DepthResNet.py", line 43, in forward x = self.encoder(rgb) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/workspace/packnet-sfm/packnet_sfm/networks/layers/resnet/resnet_encoder.py", line 88, in forward x = (input_image - 0.45) / 0.225 RuntimeError: CUDA error: no kernel image is available for execution on the device Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(self._args, self._kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get return _ForkingPickler.loads(res) File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd fd = df.detach() File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/usr/lib/python3.6/multiprocessing/connection.py", line 493, in Client answer_challenge(c, authkey) File "/usr/lib/python3.6/multiprocessing/connection.py", line 737, in answer_challenge response = connection.recv_bytes(256) # reject large message File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/usr/lib/python3.6/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer

Makefile:77: recipe for target 'docker-run' failed make: *** [docker-run] Error 1

my nvidia-smi output from: sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi Thu Nov 25 13:15:33 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 3070 On | 00000000:01:00.0 On | N/A | | 0% 41C P8 22W / 270W | 206MiB / 7973MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+ Any idea? Thanks for the help!

VitorGuizilini-TRI commented 2 years ago

You could try changing the CUDA version (and, by extension, the CUDNN and NCCL) versions of our dockerfile, so it is compatible with your machine. Other than that, these CUDA errors are very hard to debug from my end, unfortunately.

AnushaManila commented 2 years ago

I faced the same issue and for me the fix was to install proper versions of torch and torchvision pip3 install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

ArduinoHocam commented 2 years ago

@AnushaManila Yeah, but what is the meaning of using docker image then?

AnushaManila commented 2 years ago

Yes I see, mine was a quick fix. You could edit the Docker file to install the proper versions of torch and torchvision and test it. Perhaps it'll work.