IDEA-Research / detrex

detrex is a research platform for DETR-based object detection, segmentation, pose estimation and other visual recognition tasks.
https://detrex.readthedocs.io/en/latest/
Apache License 2.0
2.01k stars 209 forks source link

Custom MaskDINO training crashes with a RuntimeError: Global alloc not supported yet #161

Closed alrightkami closed 1 year ago

alrightkami commented 1 year ago

When I run: cd /home/jovyan/data/kamila/detrex && python tools/train_net.py --config-file projects/maskdino/configs/maskdino_r50_coco_instance_seg_50ep.py

I get following exception:

[12/06 07:56:38 d2.engine.train_loop]: Starting training from iteration 0
/opt/conda/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:2156.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
ERROR [12/06 07:56:48 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "tools/train_net_graffiti.py", line 95, in run_step
    loss_dict = self.model(data)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/maskdino.py", line 162, in forward
    losses = self.criterion(outputs, targets,mask_dict)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/criterion.py", line 388, in forward
    indices = self.matcher(aux_outputs, targets)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/matcher.py", line 223, in forward
    return self.memory_efficient_forward(outputs, targets, cost)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/matcher.py", line 165, in memory_efficient_forward
    cost_dice = batch_dice_loss_jit(out_mask, tgt_mask)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Global alloc not supported yet

[12/06 07:56:48 d2.engine.hooks]: Overall training speed: 3 iterations in 0:00:04 (1.3375 s / it)
[12/06 07:56:48 d2.engine.hooks]: Total training time: 0:00:04 (0:00:00 on hooks)
[12/06 07:56:49 d2.utils.events]:  eta: 4 days, 17:05:31  iter: 5  total_loss: 109.9  loss_ce: 4.103  loss_mask: 1.045  loss_dice: 1.17  loss_bbox: 0.2272  loss_giou: 0.1185  loss_ce_dn: 0.3122  loss_mask_dn: 1.603  loss_dice_dn: 1.205  loss_bbox_dn: 0.5139  loss_giou_dn: 0.328  loss_ce_0: 3.23  loss_mask_0: 1.752  loss_dice_0: 1.341  loss_bbox_0: 0.132  loss_giou_0: 0.1527  loss_ce_dn_0: 0.3276  loss_mask_dn_0: 2.752  loss_dice_dn_0: 3.534  loss_bbox_dn_0: 1.171  loss_giou_dn_0: 0.8217  loss_ce_1: 3.531  loss_mask_1: 1.334  loss_dice_1: 0.9349  loss_bbox_1: 0.09554  loss_giou_1: 0.111  loss_ce_dn_1: 0.2215  loss_mask_dn_1: 1.734  loss_dice_dn_1: 1.629  loss_bbox_dn_1: 0.795  loss_giou_dn_1: 0.4993  loss_ce_2: 3.148  loss_mask_2: 1.184  loss_dice_2: 1.401  loss_bbox_2: 0.1707  loss_giou_2: 0.1778  loss_ce_dn_2: 0.3696  loss_mask_dn_2: 1.644  loss_dice_dn_2: 1.608  loss_bbox_dn_2: 0.6091  loss_giou_dn_2: 0.4159  loss_ce_3: 3.638  loss_mask_3: 1.113  loss_dice_3: 1.233  loss_bbox_3: 0.2145  loss_giou_3: 0.1722  loss_ce_dn_3: 0.3632  loss_mask_dn_3: 1.652  loss_dice_dn_3: 1.442  loss_bbox_dn_3: 0.5413  loss_giou_dn_3: 0.3912  loss_ce_4: 3.436  loss_mask_4: 1.092  loss_dice_4: 1.122  loss_bbox_4: 0.2232  loss_giou_4: 0.1488  loss_ce_dn_4: 0.297  loss_mask_dn_4: 1.637  loss_dice_dn_4: 1.272  loss_bbox_dn_4: 0.5192  loss_giou_dn_4: 0.3572  loss_ce_5: 3.891  loss_mask_5: 1.315  loss_dice_5: 1.075  loss_bbox_5: 0.2148  loss_giou_5: 0.1458  loss_ce_dn_5: 0.2682  loss_mask_dn_5: 1.616  loss_dice_dn_5: 1.213  loss_bbox_dn_5: 0.5183  loss_giou_dn_5: 0.3461  loss_ce_6: 3.985  loss_mask_6: 1.168  loss_dice_6: 1.15  loss_bbox_6: 0.245  loss_giou_6: 0.1321  loss_ce_dn_6: 0.2713  loss_mask_dn_6: 1.57  loss_dice_dn_6: 1.197  loss_bbox_dn_6: 0.514  loss_giou_dn_6: 0.3357  loss_ce_7: 4.099  loss_mask_7: 1.093  loss_dice_7: 1.2  loss_bbox_7: 0.2284  loss_giou_7: 0.1184  loss_ce_dn_7: 0.3286  loss_mask_dn_7: 1.586  loss_dice_dn_7: 1.212  loss_bbox_dn_7: 0.5173  loss_giou_dn_7: 0.3303  loss_ce_8: 4.038  loss_mask_8: 1.049  loss_dice_8: 1.206  loss_bbox_8: 0.2284  loss_giou_8: 0.1187  loss_ce_dn_8: 0.3104  loss_mask_dn_8: 1.605  loss_dice_dn_8: 1.197  loss_bbox_dn_8: 0.51  loss_giou_dn_8: 0.3282  loss_ce_interm: 3.285  loss_mask_interm: 1.477  loss_dice_interm: 1.158  loss_bbox_interm: 0.6809  loss_giou_interm: 0.4867  time: 1.1044  data_time: 0.1023  lr: 0.0001  max_mem: 19044M
Traceback (most recent call last):
  File "tools/train_net_graffiti.py", line 232, in <module>
    launch(
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/launch.py", line 82, in launch
    main_func(*args)
  File "tools/train_net_graffiti.py", line 227, in main
    do_train(args, cfg)
  File "tools/train_net_graffiti.py", line 211, in do_train
    trainer.train(start_iter, cfg.train.max_iter)
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "tools/train_net_graffiti.py", line 95, in run_step
    loss_dict = self.model(data)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/maskdino.py", line 162, in forward
    losses = self.criterion(outputs, targets,mask_dict)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/criterion.py", line 388, in forward
    indices = self.matcher(aux_outputs, targets)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/matcher.py", line 223, in forward
    return self.memory_efficient_forward(outputs, targets, cost)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/data/kamila/detrex/projects/maskdino/modeling/matcher.py", line 165, in memory_efficient_forward
    cost_dice = batch_dice_loss_jit(out_mask, tgt_mask)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Global alloc not supported yet

So far I figured out the reason why it may appear. It seems that the workaround that uses batch_dice_loss instead of batch_dice_loss_jit as discussed in the issue fixes it, however, the training speed increases.

Would really appreciate you looking at it.

HaoZhang534 commented 1 year ago

Hi @alrightkami, I saw your pull request. It seems you use "batch_dice_loss" instead of "batch_dice_loss_jit" when there is no annotations. Does the bug only appear when there is no annotation?

alrightkami commented 1 year ago

hey there @HaoZhang534, in my custom dataset I indeed have images with no annotations (=hard negatives). However, in the config file, I have dataloader.train.dataset.filter_empty = True. I'm not sure if the filtering happens before or after this step?

alrightkami commented 1 year ago

Also, I started a training yesterday around 5 pm for train.max_iter = 36875 and it said eta: 6:23:50, but after more than 12 hours is still in progress and says eta: 1:34:07. I always let it train on a pretty strong GPU machine and it never took that long. So apparently batch_dice_loss makes training much slower.

FengLi-ust commented 1 year ago

Have you solved your problem? I have merged your PR.

alrightkami commented 1 year ago

@FengLi-ust it does solve it, the training is not crashing and is still in progress. But as mentioned above, the training speed increases enormously

FengLi-ust commented 1 year ago

You can refer to this issue to see if this can solve you problem.

alrightkami commented 1 year ago

It does seem to solve it; however, not sure about the training speed yet. I created a PR for the fix

rentainhe commented 1 year ago

I'm closing this issue~ feel free to reopen it if needed~