loading checkpoint error

alpacaduby commented 1 year ago

I can load checkpoint correctly if I run train_ds.py, but when I use deepspeed as the given example, this error occurs. Can you tell me how to fix it?

You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 Loading checkpoint shards: 50%|███████████████████████████████████████████████████████████▌ | 1/2 [00:10<00:10, 10.36s/it][2023-08-22 11:46:22,908] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3210 [2023-08-22 11:46:23,766] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3211 [2023-08-22 11:46:24,696] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3212 Loading checkpoint shards: 50%|███████████████████████████████████████████████████████████▌ | 1/2 [00:18<00:18, 18.49s/it][2023-08-22 11:46:25,962] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3213 [2023-08-22 11:46:26,758] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3214 [2023-08-22 11:46:26,758] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3215 [2023-08-22 11:46:27,597] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3216 [2023-08-22 11:46:28,151] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3218 [2023-08-22 11:46:28,827] [ERROR] [launch.py:321:sigkill_handler] ['/home/TianYunjie/anaconda3/envs/lisa/bin/python', '-u', 'train_ds.py', '--local_rank=7', '--version=liuhaotian/LLaVA-Lightning-7B-delta-v1-1', '--dataset_dir=/home/ubuntu/Workspace/TianYunjie/datasets/LISA_datasets/', '--vision_pretrained=sam_vit_h_4b8939.pth', '--dataset=sem_seg||refer_seg||vqa||reason_seg', '--sample_rates=9,3,3,1', '--exp_name=lisa-7b'] exits with return code = -9

and here is the related information:

[2023-08-22 11:45:50,020] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-22 11:45:52,164] [WARNING] [runner.py:201:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-08-22 11:45:52,224] [INFO] [runner.py:567:main] cmd = /home/TianYunjie/anaconda3/envs/lisa/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=24999 --enable_each_rank_log=None train_ds.py --version=liuhaotian/LLaVA-Lightning-7B-delta-v1-1 --dataset_dir=/home/ubuntu/Workspace/TianYunjie/datasets/LISA_datasets/ --vision_pretrained=sam_vit_h_4b8939.pth --dataset=sem_seg||refer_seg||vqa||reason_seg --sample_rates=9,3,3,1 --exp_name=lisa-7b [2023-08-22 11:45:54,525] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-22 11:45:56,379] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}

X-Lai commented 1 year ago

It seems that RAM has been run out. You may consider using more CPU memory or lowering the number of GPUs.

alpacaduby commented 1 year ago

The memory is 125G in total and around 110G available as the command 'free -h' display, isn't it enough? May I know the memory requirement for LISA ?

X-Lai commented 1 year ago

Hi, it actually consumes not that much for a single process. But in your case, you use 8 GPUs for training, so the CPU memory consumption is approximately 8 times.

FYI, in my side, 256G CPU memory is enough for training on 8*3090 24G GPUs.

dvlab-research / LISA

loading checkpoint error #33