dvlab-research / LISA

Project Page for "LISA: Reasoning Segmentation via Large Language Model"
Apache License 2.0
1.88k stars 131 forks source link

distributed training error #96

Open ruida opened 11 months ago

ruida commented 11 months ago

Hi,

Here is my slurm file. I allocate 4 A100 cards with 64g RAM.

!/bin/bash

SBATCH --time=72:00:00

SBATCH --mem=64g

SBATCH --job-name="lisa"

SBATCH --partition=gpu

SBATCH --gres=gpu:a100:4

SBATCH --cpus-per-task=4

SBATCH --mail-type=BEGIN,END,ALL

ml CUDA/11.8 ml cuDNN/8.0.3

source myconda conda activate lisa

cd "/data/me/LLM/LISA" \ && deepspeed --master_port=24999 train_ds.py --version="./pretrained/LLaVA/LLaVA-Lightning-7B-delta-v1-1" --dataset_dir='./dataset' --vision_pretrained="./pretrained/SAM/sam_vit_h_4b8939.pth" --dataset="sem_seg||refer_seg||vqa||reason_seg" --sample_rates="9,3,3,1" --exp_name="lisa-7b"

I get the distributed training error message with exit code -9. Any input? Thanks,
--Ruida

Detected CUDA_VISIBLE_DEVICES=0,1,2,3: setting --include=localhost:0,1,2,3 [2023-12-23 08:29:28,147] [INFO] [runner.py:570:main] cmd = /gpfs/gsfs12/users/me/conda/envs/lisa/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=24999 --enable_each_rank_log=None train_ds.py --version=./pretrained/LLaVA/LLaVA-Lightning-7B-delta-v1-1 --dataset_dir=./dataset --vision_pretrained=./pretrained/SAM/sam_vit_h_4b8939.pth --dataset=sem_seg||refer_seg||vqa||reason_seg --sample_rates=9,3,3,1 --exp_name=lisa-7b [2023-12-23 08:29:31,691] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-23 08:29:35,580] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2023-12-23 08:29:35,580] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0 [2023-12-23 08:29:35,580] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]}) [2023-12-23 08:29:35,580] [INFO] [launch.py:163:main] dist_world_size=4 [2023-12-23 08:29:35,580] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 [2023-12-23 08:29:49,095] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-23 08:29:49,097] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-23 08:29:49,105] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-23 08:29:49,106] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 ^Mconfig.json: 0%| | 0.00/4.52k [00:00<?, ?B/s]^Mconfig.json: 100%|██████████| 4.52k/4.52k [00:00<00:00, 4.41MB/s] ^MLoading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]^MLoading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]^MLoading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]^MLoading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s][2023-12-23 08:30:40,666] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3519203 [2023-12-23 08:30:41,879] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3519204 [2023-12-23 08:30:41,880] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3519205 [2023-12-23 08:30:41,887] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3519206 [2023-12-23 08:30:42,787] [ERROR] [launch.py:321:sigkill_handler] ['/gpfs/gsfs12/users/me/conda/envs/lisa/bin/python', '-u', 'train_ds.py', '--local_rank=3', '--version=./pretrained/LLaVA/LLaVA-Lightning-7B-delta-v1-1', '--dataset_dir=./dataset', '--vision_pretrained=./pretrained/SAM/sam_vit_h_4b8939.pth', '--dataset=sem_seg||refer_seg||vqa||reason_seg', '--sample_rates=9,3,3,1', '--exp_name=lisa-7b'] exits with return code = -9

ruida commented 10 months ago

I just solve the issue. You can close it now. Thanks.