SJTU-LuHe / TransVOD

The repository is the code for the paper "End-to-End Video Object Detection with Spatial-TemporalTransformers"
Apache License 2.0
212 stars 28 forks source link

RuntimeError: CUDA error: the launch timed out and was terminated #21

Open white-black66 opened 2 years ago

white-black66 commented 2 years ago

Hello author,thank you for your work.I would like to ask you a question about CUDA. When I try to evaluate your provided multi-frame model with 14 ref frames, using r50_eval_multi.sh, When I was doing the evaluation experiment, near the end there was an error is RuntimeError: CUDA error: the launch timed out and was terminated . I have 4 GPUs the command: GPUS_PER_NODE=4 ./tools/run_dist_launch.sh $1 eval_r50 $2 configs/r50_eval_multi.sh The logs: Test: [42690/44032] eta: 0:26:59 class_error: 0.00 loss: 1.0426 (1.1629) loss_bbox: 0.3274 (0.2952) loss_ce: 0.3063 (0.5222) loss_giou: 0.3216 (0.3455) cardinality_error_unscaled: 299.0000 (298.3894) class_error_unscaled: 0.0000 (18.9314) loss_bbox_unscaled: 0.0655 (0.0590) loss_ce_unscaled: 0.1531 (0.2611) loss_giou_unscaled: 0.1608 (0.1728) time: 1.2177 data: 0.0269 max mem: 2606 Traceback (most recent call last): File "main.py", line 331, in main(args) File "main.py", line 280, in main data_loader_val, base_ds, device, args.output_dir) File "/home/wmt/anaconda3/envs/Trans/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, *kwargs) File "/media/wmt/Data/exp/TransVOD/engine_multi.py", line 104, in evaluate loss_dict = criterion(outputs, targets) File "/home/wmt/anaconda3/envs/Trans/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "/media/wmt/Data/exp/TransVOD/models/deformable_detr_multi.py", line 355, in forward num_boxes = torch.clamp(num_boxes / get_world_size(), min=1).item() RuntimeError: CUDA error: the launch timed out and was terminated

itbergl commented 1 year ago

Hi, @white-black66. Maybe pretty obvious but the error is a result of a CUDA call to the GPU taking too long (as in a few seconds). What GPU are you using?

The author uses a V100, and I was able to evaluate it on a TIAN-XP (albeit very slowly).