RuntimeError: CUDA error: invalid device ordinal

mansooreh1 commented 7 months ago

Hello Thanks for your pretty code. When I run ! DETR=base python main.py --pretrained checkpoints/detr-r50-hicodet.pth \ --output-dir outputs/pvic-detr-r50-hicodet for training, I get the following error: /content/drive/MyDrive/pvic Namespace(backbone='resnet50', dilation=False, position_embedding='sine', hidden_dim=256, enc_layers=6, dec_layers=6, dim_feedforward=2048, dropout=0.1, nheads=8, num_queries=100, pre_norm=False, lr_head=0.0001, lr_drop=20, lr_drop_factor=0.2, epochs=30, batch_size=16, weight_decay=0.0001, clip_max_norm=0.1, aux_loss=True, set_cost_class=1, set_cost_bbox=5, set_cost_giou=2, bbox_loss_coef=5, giou_loss_coef=2, eos_coef=0.1, device='cuda', dataset='hicodet', partitions=['train2015', 'test2015'], num_workers=2, data_root='./hicodet', output_dir='outputs/pvic-detr-r50-hicodet', pretrained='checkpoints/detr-r50-hicodet.pth', print_interval=100, detector='base', raw_lambda=2.8, kv_src='C5', repr_dim=384, triplet_enc_layers=1, triplet_dec_layers=2, alpha=0.5, gamma=0.1, box_score_thresh=0.05, min_instances=3, max_instances=15, resume='', use_wandb=False, port='1234', seed=140, world_size=8, eval=False, cache=False, sanity=False) [W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:1234 (errno: 99 - Cannot assign requested address). [W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:1234 (errno: 99 - Cannot assign requested address). [W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:1234 (errno: 99 - Cannot assign requested address). [W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:1234 (errno: 99 - Cannot assign requested address). Traceback (most recent call last): File "/content/drive/MyDrive/pvic/main.py", line 193, in mp.spawn(main, nprocs=args.world_size, args=(args,)) File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 241, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes while not context.join(): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 158, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 68, in _wrap fn(i, *args) File "/content/drive/MyDrive/pvic/main.py", line 43, in main torch.cuda.set_device(rank) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 408, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

mansooreh1 commented 7 months ago

I run on google colab

fredzzhang commented 7 months ago

Hi @mansooreh1,

Based on the error log, it seems like port 1234 was not available for your device. You'll need find out what local ports are available for Google Colab.

Fred.

mansooreh1 commented 7 months ago

Hi, thanks for your reply. Does that mean your code can't be run on google colab?

fredzzhang commented 7 months ago

I don't know how the communication ports on Google Colab are set up. But if you find an available port, I don't see why you couldn't run it on Google Colab.

fredzzhang / pvic

RuntimeError: CUDA error: invalid device ordinal #47