NVlabs / VoxFormer

Official PyTorch implementation of VoxFormer [CVPR 2023 Highlight]
Other
1.07k stars 87 forks source link

stage 1 error #55

Open hitbuyi opened 4 months ago

hitbuyi commented 4 months ago

2024-07-10 22:44:07,575 - mmdet - INFO - Saving checkpoint at 1 epochs
[                                                  ] 0/815, elapsed: 0s, ETA:Traceback (most recent call last):
  File "./tools/train.py", line 261, in <module>
    main()
  File "./tools/train.py", line 250, in main
    custom_train_model(
  File "/home/hitbuyi/AD_Projects/Pytorch_Project/VoxFormer/projects/mmdet3d_plugin/voxformer/apis/train.py", line 27, in custom_train_model
    custom_train_detector(
  File "/home/hitbuyi/AD_Projects/Pytorch_Project/VoxFormer/projects/mmdet3d_plugin/voxformer/apis/mmdet_train.py", line 200, in custom_train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/hitbuyi/.conda/envs/pt110/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/hitbuyi/.conda/envs/pt110/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
    self.call_hook('after_train_epoch')
  File "/home/hitbuyi/.conda/envs/pt110/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/hitbuyi/.conda/envs/pt110/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch
    self._do_evaluate(runner)
  File "/home/hitbuyi/AD_Projects/Pytorch_Project/VoxFormer/projects/mmdet3d_plugin/core/evaluation/eval_hooks.py", line 77, in _do_evaluate
    results = custom_multi_gpu_test(
  File "/home/hitbuyi/AD_Projects/Pytorch_Project/VoxFormer/projects/mmdet3d_plugin/voxformer/apis/test.py", line 81, in custom_multi_gpu_test
    result = model(return_loss=False, rescale=True, **data)
  File "/home/hitbuyi/.conda/envs/pt110/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hitbuyi/.conda/envs/pt110/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/hitbuyi/.conda/envs/pt110/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hitbuyi/AD_Projects/Pytorch_Project/VoxFormer/projects/mmdet3d_plugin/voxformer/detectors/lmscnet.py", line 249, in forward
    return self.foward_test(**kwargs)
  File "/home/hitbuyi/AD_Projects/Pytorch_Project/VoxFormer/projects/mmdet3d_plugin/voxformer/detectors/lmscnet.py", line 317, in foward_test
    y_pred_bin.tofile(save_query_path)
NameError: name 'save_query_path' is not defined
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 426443) of binary: /home/hitbuyi/.conda/envs/pt110/bin/python
Traceback (most recent call last):
  File "/home/hitbuyi/.conda/envs/pt110/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/hitbuyi/.conda/envs/pt110/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/hitbuyi/.conda/envs/pt110/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/hitbuyi/.conda/envs/pt110/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/hitbuyi/.conda/envs/pt110/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/hitbuyi/.conda/envs/pt110/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/hitbuyi/.conda/envs/pt110/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/hitbuyi/.conda/envs/pt110/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-10_22:44:14
  host      : hitbuyi-Dell-G15-5511
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 426443)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================