InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.73k stars 302 forks source link

how to train multi tasks on different gpus at the same time? #726

Open ztfmars opened 3 months ago

ztfmars commented 3 months ago

i have 2 x a100 gpus, i hava been training one task on gpu1, and i want to train another tasks on gpu2 at the same time, but i get error as followings:


CUDA_VISIBLE_DEVICES=1 \
xtuner train /home/fusionai/project/internllm_demo/llama3/llama3-ft/configs/ztf_llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_stf_1k_repeated_4k_codeflow_test.py \
--work-dir /home/fusionai/project/internllm_demo/llama3/llama3-ft/train/llava_train_test_multigpu \
--deepspeed deepspeed_zero2

[2024-05-28 15:52:17,776] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
[2024-05-28 15:52:23,113] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
05/28 15:52:24 - mmengine - INFO -
...
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
Load pretrained weight from /home/fusionai/project/internllm_demo/llama3/pretrained-model/llama3-llava-iter_2181.pth
[2024-05-28 15:53:24,697] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.14.2, git-hash=unknown, git-branch=unknown
[2024-05-28 15:53:24,697] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-28 15:53:24,697] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[2024-05-28 15:53:24,714] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=10.212.68.84, master_port=29500
[2024-05-28 15:53:24,714] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[W socket.cpp:436] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:436] [c10d] The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
[E socket.cpp:472] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
  File "/home/fusionai/project/internllm/xtuner/xtuner/tools/train.py", line 360, in <module>
    main()
  File "/home/fusionai/project/internllm/xtuner/xtuner/tools/train.py", line 356, in main
    runner.train()
  File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1182, in train
    self.strategy.prepare(
  File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 386, in prepare
    self.model = self._wrap_model(model)
  File "/home/fusionai/project/internllm/xtuner/xtuner/engine/_strategy/deepspeed.py", line 25, in _wrap_model
    wrapper = super()._wrap_model(model)
  File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 401, in _wrap_model
    engine, self.optim_wrapper.optimizer, *_ = deepspeed.initialize(
  File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/__init__.py", line 143, in initialize
    dist.init_distributed(dist_backend=dist_backend,
  File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 670, in init_distributed
    cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
  File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 112, in __init__
    self.init_process_group(backend, timeout, init_method, rank, world_size)
  File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 142, in init_process_group
    torch.distributed.init_process_group(backend,
  File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
    func_return = func(*args, **kwargs)
  File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1141, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 241, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
    return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
...

how to solve this problems. look forward to your resonse,
thx
mylesgoose commented 1 month ago

Can you specify a different server socket to use in your run command for the second script?

HeegonJin commented 3 weeks ago

I am having the same issue