facebookresearch / detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
https://detectron2.readthedocs.io/en/latest/
Apache License 2.0
30.6k stars 7.5k forks source link

introduce torch.compile in DDP mode cause abnormal terminated with signal SIGSEGV #5325

Open proshanm opened 4 months ago

proshanm commented 4 months ago

Instructions To Reproduce the Issue:

to speedup training, I add torch.compile operation after DistributedDataParallel in detectron2/engine/defaults.py:

    ddp = DistributedDataParallel(model, **kwargs)
    ddp = torch.compile(ddp, mode="max-autotune")

And I trained ViTDet model, it terminated abnormally with following exception message:

Traceback (most recent call last):
  File "/home/jupyterhub/proshm/detectron2/tools/lazyconfig_train_net.py", line 124, in <module>
    launch(
  File "/home/jupyterhub/proshm/detectron2/detectron2/engine/launch.py", line 69, in launch
    mp.start_processes(
  File "/home/ps/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/home/ps/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 145, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with signal SIGSEGV

Environment:

I installed detectron2 in ubuntu with pytorch 2.1

github-actions[bot] commented 4 months ago

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template. The following information is missing: "Instructions To Reproduce the Issue and Full Logs"; "Your Environment";

Programmer-RD-AI commented 4 months ago

Hi, Please Check: How to fix a SIGSEGV in pytorch when using distributed training (e.g. DDP)? Process 3 terminated with signal SIGSEGV

Thank You