Open white-black66 opened 2 years ago
Hi, @white-black66. Maybe pretty obvious but the error is a result of a CUDA call to the GPU taking too long (as in a few seconds). What GPU are you using?
The author uses a V100, and I was able to evaluate it on a TIAN-XP (albeit very slowly).
Hello author,thank you for your work.I would like to ask you a question about CUDA. When I try to evaluate your provided multi-frame model with 14 ref frames, using r50_eval_multi.sh, When I was doing the evaluation experiment, near the end there was an error is RuntimeError: CUDA error: the launch timed out and was terminated . I have 4 GPUs the command: GPUS_PER_NODE=4 ./tools/run_dist_launch.sh $1 eval_r50 $2 configs/r50_eval_multi.sh The logs: Test: [42690/44032] eta: 0:26:59 class_error: 0.00 loss: 1.0426 (1.1629) loss_bbox: 0.3274 (0.2952) loss_ce: 0.3063 (0.5222) loss_giou: 0.3216 (0.3455) cardinality_error_unscaled: 299.0000 (298.3894) class_error_unscaled: 0.0000 (18.9314) loss_bbox_unscaled: 0.0655 (0.0590) loss_ce_unscaled: 0.1531 (0.2611) loss_giou_unscaled: 0.1608 (0.1728) time: 1.2177 data: 0.0269 max mem: 2606 Traceback (most recent call last): File "main.py", line 331, in
main(args)
File "main.py", line 280, in main
data_loader_val, base_ds, device, args.output_dir)
File "/home/wmt/anaconda3/envs/Trans/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, *kwargs)
File "/media/wmt/Data/exp/TransVOD/engine_multi.py", line 104, in evaluate
loss_dict = criterion(outputs, targets)
File "/home/wmt/anaconda3/envs/Trans/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(input, **kwargs)
File "/media/wmt/Data/exp/TransVOD/models/deformable_detr_multi.py", line 355, in forward
num_boxes = torch.clamp(num_boxes / get_world_size(), min=1).item()
RuntimeError: CUDA error: the launch timed out and was terminated