i have 2 x a100 gpus,
i hava been training one task on gpu1,
and i want to train another tasks on gpu2 at the same time,
but i get error as followings:
CUDA_VISIBLE_DEVICES=1 \
xtuner train /home/fusionai/project/internllm_demo/llama3/llama3-ft/configs/ztf_llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_stf_1k_repeated_4k_codeflow_test.py \
--work-dir /home/fusionai/project/internllm_demo/llama3/llama3-ft/train/llava_train_test_multigpu \
--deepspeed deepspeed_zero2
[2024-05-28 15:52:17,776] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
[2024-05-28 15:52:23,113] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
05/28 15:52:24 - mmengine - INFO -
...
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
Load pretrained weight from /home/fusionai/project/internllm_demo/llama3/pretrained-model/llama3-llava-iter_2181.pth
[2024-05-28 15:53:24,697] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.14.2, git-hash=unknown, git-branch=unknown
[2024-05-28 15:53:24,697] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-28 15:53:24,697] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[2024-05-28 15:53:24,714] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=10.212.68.84, master_port=29500
[2024-05-28 15:53:24,714] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[W socket.cpp:436] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:436] [c10d] The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
[E socket.cpp:472] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
File "/home/fusionai/project/internllm/xtuner/xtuner/tools/train.py", line 360, in <module>
main()
File "/home/fusionai/project/internllm/xtuner/xtuner/tools/train.py", line 356, in main
runner.train()
File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1182, in train
self.strategy.prepare(
File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 386, in prepare
self.model = self._wrap_model(model)
File "/home/fusionai/project/internllm/xtuner/xtuner/engine/_strategy/deepspeed.py", line 25, in _wrap_model
wrapper = super()._wrap_model(model)
File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 401, in _wrap_model
engine, self.optim_wrapper.optimizer, *_ = deepspeed.initialize(
File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/__init__.py", line 143, in initialize
dist.init_distributed(dist_backend=dist_backend,
File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 670, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 112, in __init__
self.init_process_group(backend, timeout, init_method, rank, world_size)
File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 142, in init_process_group
torch.distributed.init_process_group(backend,
File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
func_return = func(*args, **kwargs)
File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1141, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 241, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/home/fusionai/anaconda3/envs/llama3/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
...
how to solve this problems. look forward to your resonse,
thx
i have 2 x a100 gpus, i hava been training one task on gpu1, and i want to train another tasks on gpu2 at the same time, but i get error as followings: