fredzzhang / pvic

Official PyTorch implementation for ICCV2023 paper "Exploring Predicate Visual Context in Detecting Human-Object Interactions"
BSD 3-Clause "New" or "Revised" License
56 stars 7 forks source link

RuntimeError: CUDA error: invalid device ordinal #47

Closed mansooreh1 closed 1 month ago

mansooreh1 commented 2 months ago

Hello Thanks for your pretty code. When I run ! DETR=base python --pretrained checkpoints/detr-r50-hicodet.pth \ --output-dir outputs/pvic-detr-r50-hicodet for training, I get the following error: /content/drive/MyDrive/pvic Namespace(backbone='resnet50', dilation=False, position_embedding='sine', hidden_dim=256, enc_layers=6, dec_layers=6, dim_feedforward=2048, dropout=0.1, nheads=8, num_queries=100, pre_norm=False, lr_head=0.0001, lr_drop=20, lr_drop_factor=0.2, epochs=30, batch_size=16, weight_decay=0.0001, clip_max_norm=0.1, aux_loss=True, set_cost_class=1, set_cost_bbox=5, set_cost_giou=2, bbox_loss_coef=5, giou_loss_coef=2, eos_coef=0.1, device='cuda', dataset='hicodet', partitions=['train2015', 'test2015'], num_workers=2, data_root='./hicodet', output_dir='outputs/pvic-detr-r50-hicodet', pretrained='checkpoints/detr-r50-hicodet.pth', print_interval=100, detector='base', raw_lambda=2.8, kv_src='C5', repr_dim=384, triplet_enc_layers=1, triplet_dec_layers=2, alpha=0.5, gamma=0.1, box_score_thresh=0.05, min_instances=3, max_instances=15, resume='', use_wandb=False, port='1234', seed=140, world_size=8, eval=False, cache=False, sanity=False) [W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:1234 (errno: 99 - Cannot assign requested address). [W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:1234 (errno: 99 - Cannot assign requested address). [W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:1234 (errno: 99 - Cannot assign requested address). [W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:1234 (errno: 99 - Cannot assign requested address). Traceback (most recent call last): File "/content/drive/MyDrive/pvic/", line 193, in mp.spawn(main, nprocs=args.world_size, args=(args,)) File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/", line 241, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/", line 197, in start_processes while not context.join(): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/", line 158, in join raise ProcessRaisedException(msg, error_index, torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/", line 68, in _wrap fn(i, *args) File "/content/drive/MyDrive/pvic/", line 43, in main torch.cuda.set_device(rank) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/", line 408, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

mansooreh1 commented 2 months ago

I run on google colab

fredzzhang commented 2 months ago

Hi @mansooreh1,

Based on the error log, it seems like port 1234 was not available for your device. You'll need find out what local ports are available for Google Colab.


mansooreh1 commented 2 months ago

Hi, thanks for your reply. Does that mean your code can't be run on google colab?

fredzzhang commented 2 months ago

I don't know how the communication ports on Google Colab are set up. But if you find an available port, I don't see why you couldn't run it on Google Colab.