OpenDriveLab / UniAD

[CVPR 2023 Best Paper Award] Planning-oriented Autonomous Driving
Apache License 2.0
3.39k stars 372 forks source link

Dataloader worker killed with runtime error. #62

Open vgudapati opened 1 year ago

vgudapati commented 1 year ago

Hello,

While training stage to network, im seeing the following error.

Is anyone seeing the same error?

Traceback (most recent call last): File "./tools/train.py", line 256, in main() File "./tools/train.py", line 245, in main custom_train_model( File "/home/ubuntu/torc/git/personal/UniAD/projects/mmdet3d_plugin/uniad/apis/train.py", line 21, in custom_train_model custom_train_detector( File "/home/ubuntu/torc/git/personal/UniAD/projects/mmdet3d_plugin/uniad/apis/mmdet_train.py", line 194, in custom_train_detector runner.run(data_loaders, cfg.workflow) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], kwargs) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train self.run_iter(data_batch, train_mode=True, kwargs) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 52, in train_step output = self.module.train_step(inputs[0], kwargs[0]) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 237, in train_step losses = self(data) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/home/ubuntu/torc/git/personal/UniAD/projects/mmdet3d_plugin/uniad/detectors/uniad_e2e.py", line 81, in forward return self.forward_train(kwargs) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func return old_func(*args, kwargs) File "/home/ubuntu/torc/git/personal/UniAD/projects/mmdet3d_plugin/uniad/detectors/uniad_e2e.py", line 163, in forward_train losses_track, outs_track = self.forward_track_train(img, gt_bboxes_3d, gt_labels_3d, gt_past_traj, gt_past_traj_mask, gt_inds, gt_sdc_bbox, gt_sdc_label, File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func return old_func(*args, *kwargs) File "/home/ubuntu/torc/git/personal/UniAD/projects/mmdet3d_plugin/uniad/detectors/uniad_track.py", line 555, in forward_track_train frame_res = self._forward_single_frame_train( File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func return old_func(args, kwargs) File "/home/ubuntu/torc/git/personal/UniAD/projects/mmdet3d_plugin/uniad/detectors/uniad_track.py", line 385, in _forward_single_frame_train bev_embed, bev_pos = self.get_bevs( File "/home/ubuntu/torc/git/personal/UniAD/projects/mmdet3d_plugin/uniad/detectors/uniad_track.py", line 342, in get_bevs img_feats = self.extract_img_feat(img=imgs) File "/home/ubuntu/torc/git/personal/UniAD/projects/mmdet3d_plugin/uniad/detectors/uniad_track.py", line 162, in extract_img_feat img_feats = self.img_backbone(img) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, kwargs) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/mmdet/models/backbones/resnet.py", line 638, in forward x = self.maxpool(x) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/nn/modules/pooling.py", line 162, in forward return F.max_pool2d(input, self.kernel_size, self.stride, File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/_jit_internal.py", line 405, in fn return if_false(args, kwargs) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/nn/functional.py", line 718, in _max_pool2d return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 1103017) is killed by signal: Killed. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 1099909) of binary: /home/ubuntu/.conda/envs/uniad/bin/python /home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning:


           CHILD PROCESS FAILED WITH NO ERROR_FILE                

CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 1099909 (local_rank 1) FAILED (exitcode 1) Error msg: Process failed with exitcode 1 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record def trainer_main(args):

do train


warnings.warn(_no_error_file_warning_msg(rank, failure)) Traceback (most recent call last): File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in main() File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 361, in wrapper return f(*args, **kwargs) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main run(args) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


    ./tools/train.py FAILED        

======================================= Root Cause: [0]: time: 2023-07-07_12:12:31 rank: 1 (local_rank: 1) exitcode: 1 (pid: 1099909) error_file: <N/A> msg: "Process failed with exitcode 1"

Other Failures:

*************************************** Thanks for your attention. I'm training this on an AWS EC2 instance (g5-12x) with 4 A10 gpus! Regards, Venkat
daxiongpro commented 11 months ago

hello, I have the same problem. Have you solved it?

(uniad) ➜  UniAD git:(dev) ./tools/uniad_dist_train.sh ./projects/configs/stage1_track_map/base_track_map.py 1
projects.mmdet3d_plugin
Traceback (most recent call last):
  File "./tools/train.py", line 256, in <module>
    main()
  File "./tools/train.py", line 173, in main
    cfg.dump(osp.join(cfg.work_dir, osp.basename(args.config)))
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 541, in dump
    f.write(self.pretty_text)
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 496, in pretty_text
    text, _ = FormatCode(text, style_config=yapf_style, verify=True)
TypeError: FormatCode() got an unexpected keyword argument 'verify'
/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 59905) of binary: /usr/miniconda3/envs/uniad/bin/python
Traceback (most recent call last):
  File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
        ./tools/train.py FAILED        
=======================================
Root Cause:
[0]:
  time: 2023-10-27_14:46:08
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 59905)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
  <NO_OTHER_FAILURES>
***************************************
xiexu666 commented 8 months ago

hello, I have the same problem. Have you solved it?

(uniad) ➜  UniAD git:(dev) ./tools/uniad_dist_train.sh ./projects/configs/stage1_track_map/base_track_map.py 1
projects.mmdet3d_plugin
Traceback (most recent call last):
  File "./tools/train.py", line 256, in <module>
    main()
  File "./tools/train.py", line 173, in main
    cfg.dump(osp.join(cfg.work_dir, osp.basename(args.config)))
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 541, in dump
    f.write(self.pretty_text)
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 496, in pretty_text
    text, _ = FormatCode(text, style_config=yapf_style, verify=True)
TypeError: FormatCode() got an unexpected keyword argument 'verify'
/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 59905) of binary: /usr/miniconda3/envs/uniad/bin/python
Traceback (most recent call last):
  File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
        ./tools/train.py FAILED        
=======================================
Root Cause:
[0]:
  time: 2023-10-27_14:46:08
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 59905)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
  <NO_OTHER_FAILURES>
***************************************

Have you solved this ?

vgudapati commented 8 months ago

Hello,

I did solve this problem. May I ask when you are hitting this issue?

If i remember correctly, I was hitting this issue during validation check and i needed to enable the following flag which fixed it.

NCCL_P2P_DISABLE=1

Thanks Venkat

On Mon, Jan 8, 2024 at 7:48 AM xiexu666 @.***> wrote:

hello, I have the same problem. Have you solved it?

(uniad) ➜ UniAD git:(dev) ./tools/uniad_dist_train.sh ./projects/configs/stage1_track_map/base_track_map.py 1 projects.mmdet3d_plugin Traceback (most recent call last): File "./tools/train.py", line 256, in main() File "./tools/train.py", line 173, in main cfg.dump(osp.join(cfg.work_dir, osp.basename(args.config))) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 541, in dump f.write(self.pretty_text) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 496, in prettytext text, = FormatCode(text, style_config=yapf_style, verify=True) TypeError: FormatCode() got an unexpected keyword argument 'verify' /usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 59905) of binary: /usr/miniconda3/envs/uniad/bin/python Traceback (most recent call last): File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: *** ./tools/train.py FAILED

Root Cause: [0]: time: 2023-10-27_14:46:08 rank: 0 (local_rank: 0) exitcode: 1 (pid: 59905) error_file: <N/A> msg: "Process failed with exitcode 1"

Other Failures:

*************************************** Have you solved this ? — Reply to this email directly, view it on GitHub , or unsubscribe . You are receiving this because you authored the thread.Message ID: ***@***.***>
LinuxCup commented 3 months ago

@xiexu666 @daxiongpro Hello, execute the following command to resolve this problem: $pip uninstall yapf $pip install yapf==0.40.1 refer:https://blog.csdn.net/ZZZZ_Y_/article/details/133902230