Error running evaluation

jarvishou829 commented 1 year ago


warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[                                                  ] 4/6019, 0.4 task/s, elapsed: 9s, ETA: 14005s/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 6084/6019, 21.4 task/s, elapsed: 284s, ETA:    -2sWARNING:torch.distributed.elastic.multiprocessing.api:Sending process 795432 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 795433 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 795434 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 795435) of binary: /opt/conda/envs/fbocc/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/fbocc/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/fbocc/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/.vscode-server/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/root/.vscode-server/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/root/.vscode-server/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 317, in run_module
    run_module_as_main(options.target, alter_argv=True)
  File "/root/.vscode-server/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 238, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/.vscode-server/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools_mm/test.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-30_01:11:31
  host      : fd-taxi-f-houjiawei-1691719963348-985c0990-3609275187
  rank      : 3 (local_rank: 3)
  exitcode  : -9 (pid: 795435)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 795435
============================================================

The evaluation run 6084 samples rather than 6019 samples and closed unexpectly.

jarvishou829 commented 1 year ago

It seems to be a memory overflow issue. The memory occupied increases abnormally when the evaluation process is almost finished.

lubinBoooos commented 11 months ago

You can try do evaluation only on one GPU

tanatomoe commented 8 months ago

Hi, I get the same error. I've read you can fix it by changing batch size, but unfortunately I can't figure out how to do that. Perhaps you could try it, and if it works tell me how to do it? It would be greatly appreaciated.

NVlabs / FB-BEV

Error running evaluation #9