I get the distributed training error message with exit code -9. Any input?
Thanks,
--Ruida
Detected CUDA_VISIBLE_DEVICES=0,1,2,3: setting --include=localhost:0,1,2,3
[2023-12-23 08:29:28,147] [INFO] [runner.py:570:main] cmd = /gpfs/gsfs12/users/me/conda/envs/lisa/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=24999 --enable_each_rank_log=None train_ds.py --version=./pretrained/LLaVA/LLaVA-Lightning-7B-delta-v1-1 --dataset_dir=./dataset --vision_pretrained=./pretrained/SAM/sam_vit_h_4b8939.pth --dataset=sem_seg||refer_seg||vqa||reason_seg --sample_rates=9,3,3,1 --exp_name=lisa-7b
[2023-12-23 08:29:31,691] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-23 08:29:35,580] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-12-23 08:29:35,580] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-12-23 08:29:35,580] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-12-23 08:29:35,580] [INFO] [launch.py:163:main] dist_world_size=4
[2023-12-23 08:29:35,580] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-12-23 08:29:49,095] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-23 08:29:49,097] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-23 08:29:49,105] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-23 08:29:49,106] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
^Mconfig.json: 0%| | 0.00/4.52k [00:00<?, ?B/s]^Mconfig.json: 100%|██████████| 4.52k/4.52k [00:00<00:00, 4.41MB/s]
^MLoading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]^MLoading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]^MLoading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]^MLoading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s][2023-12-23 08:30:40,666] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3519203
[2023-12-23 08:30:41,879] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3519204
[2023-12-23 08:30:41,880] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3519205
[2023-12-23 08:30:41,887] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3519206
[2023-12-23 08:30:42,787] [ERROR] [launch.py:321:sigkill_handler] ['/gpfs/gsfs12/users/me/conda/envs/lisa/bin/python', '-u', 'train_ds.py', '--local_rank=3', '--version=./pretrained/LLaVA/LLaVA-Lightning-7B-delta-v1-1', '--dataset_dir=./dataset', '--vision_pretrained=./pretrained/SAM/sam_vit_h_4b8939.pth', '--dataset=sem_seg||refer_seg||vqa||reason_seg', '--sample_rates=9,3,3,1', '--exp_name=lisa-7b'] exits with return code = -9
Hi,
Here is my slurm file. I allocate 4 A100 cards with 64g RAM.
!/bin/bash
SBATCH --time=72:00:00
SBATCH --mem=64g
SBATCH --job-name="lisa"
SBATCH --partition=gpu
SBATCH --gres=gpu:a100:4
SBATCH --cpus-per-task=4
SBATCH --mail-type=BEGIN,END,ALL
ml CUDA/11.8 ml cuDNN/8.0.3
source myconda conda activate lisa
cd "/data/me/LLM/LISA" \ && deepspeed --master_port=24999 train_ds.py --version="./pretrained/LLaVA/LLaVA-Lightning-7B-delta-v1-1" --dataset_dir='./dataset' --vision_pretrained="./pretrained/SAM/sam_vit_h_4b8939.pth" --dataset="sem_seg||refer_seg||vqa||reason_seg" --sample_rates="9,3,3,1" --exp_name="lisa-7b"
I get the distributed training error message with exit code -9. Any input? Thanks,
--Ruida
Detected CUDA_VISIBLE_DEVICES=0,1,2,3: setting --include=localhost:0,1,2,3 [2023-12-23 08:29:28,147] [INFO] [runner.py:570:main] cmd = /gpfs/gsfs12/users/me/conda/envs/lisa/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=24999 --enable_each_rank_log=None train_ds.py --version=./pretrained/LLaVA/LLaVA-Lightning-7B-delta-v1-1 --dataset_dir=./dataset --vision_pretrained=./pretrained/SAM/sam_vit_h_4b8939.pth --dataset=sem_seg||refer_seg||vqa||reason_seg --sample_rates=9,3,3,1 --exp_name=lisa-7b [2023-12-23 08:29:31,691] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-23 08:29:35,580] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2023-12-23 08:29:35,580] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0 [2023-12-23 08:29:35,580] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]}) [2023-12-23 08:29:35,580] [INFO] [launch.py:163:main] dist_world_size=4 [2023-12-23 08:29:35,580] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 [2023-12-23 08:29:49,095] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-23 08:29:49,097] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-23 08:29:49,105] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-23 08:29:49,106] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 ^Mconfig.json: 0%| | 0.00/4.52k [00:00<?, ?B/s]^Mconfig.json: 100%|██████████| 4.52k/4.52k [00:00<00:00, 4.41MB/s] ^MLoading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]^MLoading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]^MLoading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]^MLoading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s][2023-12-23 08:30:40,666] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3519203 [2023-12-23 08:30:41,879] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3519204 [2023-12-23 08:30:41,880] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3519205 [2023-12-23 08:30:41,887] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3519206 [2023-12-23 08:30:42,787] [ERROR] [launch.py:321:sigkill_handler] ['/gpfs/gsfs12/users/me/conda/envs/lisa/bin/python', '-u', 'train_ds.py', '--local_rank=3', '--version=./pretrained/LLaVA/LLaVA-Lightning-7B-delta-v1-1', '--dataset_dir=./dataset', '--vision_pretrained=./pretrained/SAM/sam_vit_h_4b8939.pth', '--dataset=sem_seg||refer_seg||vqa||reason_seg', '--sample_rates=9,3,3,1', '--exp_name=lisa-7b'] exits with return code = -9