Open ArduinoHocam opened 2 years ago
You could try changing the CUDA version (and, by extension, the CUDNN and NCCL) versions of our dockerfile, so it is compatible with your machine. Other than that, these CUDA errors are very hard to debug from my end, unfortunately.
I faced the same issue and for me the fix was to install proper versions of torch and torchvision pip3 install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
@AnushaManila Yeah, but what is the meaning of using docker image then?
Yes I see, mine was a quick fix. You could edit the Docker file to install the proper versions of torch and torchvision and test it. Perhaps it'll work.
Hi, I've used your docker image, after the installation and succesfull docker build, I tried to run this: download a tiny subset of KITTI curl -s https://tri-ml-public.s3.amazonaws.com/github/packnet-sfm/datasets/KITTI_tiny.tar | tar xv -C /data/datasets/ in docker make docker-run COMMAND="python3 scripts/train.py configs/overfit_kitti.yaml" However I get following error: ########################################################################################################################
Config: configs.default_config -> configs.overfit_kitti.yaml
Name: default_config-overfit_kitti-2021.11.25-11h46m13s
########################################################################################################################
0%| | 0/5004 [00:00<?, ? images/s] Traceback (most recent call last): File "scripts/train.py", line 64, in
train(args.file)
File "scripts/train.py", line 59, in train
trainer.fit(model_wrapper)
File "/workspace/packnet-sfm/packnet_sfm/trainers/horovod_trainer.py", line 63, in fit
self.train(train_dataloader, module, optimizer)
File "/workspace/packnet-sfm/packnet_sfm/trainers/horovod_trainer.py", line 90, in train
output = module.training_step(batch, i)
File "/workspace/packnet-sfm/packnet_sfm/models/model_wrapper.py", line 186, in training_step
output = self.model(batch, progress=self.progress)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, kwargs)
File "/workspace/packnet-sfm/packnet_sfm/models/SelfSupModel.py", line 83, in forward
output = super().forward(batch, return_logs=return_logs)
File "/workspace/packnet-sfm/packnet_sfm/models/SfmModel.py", line 117, in forward
depth_output = self.compute_depth_net(batch, force_flip=force_flip)
File "/workspace/packnet-sfm/packnet_sfm/models/SfmModel.py", line 85, in compute_depth_net
output = self.depth_net_flipping(batch, flag_flip_lr)
File "/workspace/packnet-sfm/packnet_sfm/models/SfmModel.py", line 78, in depth_net_flipping
output = self.depth_net(batch_input)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, kwargs)
File "/workspace/packnet-sfm/packnet_sfm/networks/depth/DepthResNet.py", line 43, in forward
x = self.encoder(rgb)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, *kwargs)
File "/workspace/packnet-sfm/packnet_sfm/networks/layers/resnet/resnet_encoder.py", line 88, in forward
x = (input_image - 0.45) / 0.225
RuntimeError: CUDA error: no kernel image is available for execution on the device
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 493, in Client
answer_challenge(c, authkey)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 737, in answer_challenge
response = connection.recv_bytes(256) # reject large message
File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Makefile:77: recipe for target 'docker-run' failed make: *** [docker-run] Error 1
my nvidia-smi output from: sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi Thu Nov 25 13:15:33 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 3070 On | 00000000:01:00.0 On | N/A | | 0% 41C P8 22W / 270W | 206MiB / 7973MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+ Any idea? Thanks for the help!