ERROR:torch.distributed.elastic.multiprocessing.api:failed

I encountered a problem while trying to test the demo of single_turn_mm.py. Could you please help me figure out what the issue is? Thank you.

My Graphics card is GTX 4090.

The system is running in a Docker container.

Linux b36678aa408c 5.19.0-43-generic #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon May 22 13:39:36 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

CUDA VERSION

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

Error log as follows:

root@b36678aa408c:~/share/workspace/AIGC/LLaMA2-Accessory/accessory# torchrun --nproc-per-node=1  demos/single_turn_mm.py --llama_config ../pretrained_weights/config/13B_params.json --tokenizer_path ../pretrained_weights/config/tokenizer.model --pretrained_path ../pretrained_weights/finetune/mm/alpacaLlava_llamaQformerv2_13b --quant
| distributed init (rank 0): env://, gpu 0
[10:31:50.989621] > initializing model parallel with size 1
[10:31:50.989646] > initializing ddp with size 1
[10:31:50.989650] > initializing pipeline with size 1
[10:31:51.001027] Model Args:
 ModelArgs(dim=5120, n_layers=40, n_heads=40, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=2048, rope_scaling=None)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 2267) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
demos/single_turn_mm.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-25_10:32:33
  host      : b36678aa408c
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 2267)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 2267
=====================================================

How much CPU RAM do you currently have? Loading the 13B model takes about 60GB of host memory with the current implementation.

On Fri, Aug 25, 2023 at 6:49 PM stwrd @.***> wrote:

I encountered a problem while trying to test the demo of single_turn_mm.py. Could you please help me figure out what the issue is? Thank you.

My Graphics card is GTX 4090.

The system is running in a Docker container.

Linux b36678aa408c 5.19.0-43-generic #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon May 22 13:39:36 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

CUDA VERSION

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Jun__8_16:49:14_PDT_2022 Cuda compilation tools, release 11.7, V11.7.99 Build cuda_11.7.r11.7/compiler.31442593_0

Error log as follows:

@.:~/share/workspace/AIGC/LLaMA2-Accessory/accessory# torchrun --nproc-per-node=1 demos/single_turn_mm.py --llama_config ../pretrained_weights/config/13B_params.json --tokenizer_path ../pretrained_weights/config/tokenizer.model --pretrained_path ../pretrained_weights/finetune/mm/alpacaLlava_llamaQformerv2_13b --quant | distributed init (rank 0): env://, gpu 0 [10:31:50.989621] > initializing model parallel with size 1 [10:31:50.989646] > initializing ddp with size 1 [10:31:50.989650] > initializing pipeline with size 1 [10:31:51.001027] Model Args: ModelArgs(dim=5120, n_layers=40, n_heads=40, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=2048, rope_scaling=None) ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 2267) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

demos/single_turn_mm.py FAILED

Failures:
----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-08-25_10:32:33 host : b36678aa408c rank : 0 (local_rank: 0) exitcode : -9 (pid: 2267) error_file: traceback : Signal 9 (SIGKILL) received by PID 2267 ===================================================== — Reply to this email directly, view it on GitHub , or unsubscribe . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Yeah,I got it,my system's only got 50GB of available memory right now.

That is actually pretty close. You may try to increase some swap memory as a workaround. The amount of CPU RAM is only for preprocessing and once the model is fully loaded and quantized, it will be moved to GPU completely and most CPU memory will be freed.

We are also aware that the peak CPU memory usage is a bit excessive at this moment and are actively working on a solution. Stay tuned for an update if you are interested!

On Fri, Aug 25, 2023 at 8:52 PM stwrd @.***> wrote:

Yeah,I got it,my system's only got 50GB of available memory right now.

— Reply to this email directly, view it on GitHub https://github.com/Alpha-VLLM/LLaMA2-Accessory/issues/56#issuecomment-1693311815, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM3OG25AX253PMZ6DVFOOJLXXCN2BANCNFSM6AAAAAA36KR3RM . You are receiving this because you commented.Message ID: @.***>

Alpha-VLLM / LLaMA2-Accessory

ERROR:torch.distributed.elastic.multiprocessing.api:failed #56

demos/single_turn_mm.py FAILED