Closed alpacaduby closed 1 year ago
It seems that RAM has been run out. You may consider using more CPU memory or lowering the number of GPUs.
The memory is 125G in total and around 110G available as the command 'free -h' display, isn't it enough? May I know the memory requirement for LISA ?
Hi, it actually consumes not that much for a single process. But in your case, you use 8 GPUs for training, so the CPU memory consumption is approximately 8 times.
FYI, in my side, 256G CPU memory is enough for training on 8*3090 24G GPUs.
I can load checkpoint correctly if I run train_ds.py, but when I use deepspeed as the given example, this error occurs. Can you tell me how to fix it?
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 Loading checkpoint shards: 50%|███████████████████████████████████████████████████████████▌ | 1/2 [00:10<00:10, 10.36s/it][2023-08-22 11:46:22,908] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3210 [2023-08-22 11:46:23,766] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3211 [2023-08-22 11:46:24,696] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3212 Loading checkpoint shards: 50%|███████████████████████████████████████████████████████████▌ | 1/2 [00:18<00:18, 18.49s/it][2023-08-22 11:46:25,962] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3213 [2023-08-22 11:46:26,758] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3214 [2023-08-22 11:46:26,758] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3215 [2023-08-22 11:46:27,597] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3216 [2023-08-22 11:46:28,151] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3218 [2023-08-22 11:46:28,827] [ERROR] [launch.py:321:sigkill_handler] ['/home/TianYunjie/anaconda3/envs/lisa/bin/python', '-u', 'train_ds.py', '--local_rank=7', '--version=liuhaotian/LLaVA-Lightning-7B-delta-v1-1', '--dataset_dir=/home/ubuntu/Workspace/TianYunjie/datasets/LISA_datasets/', '--vision_pretrained=sam_vit_h_4b8939.pth', '--dataset=sem_seg||refer_seg||vqa||reason_seg', '--sample_rates=9,3,3,1', '--exp_name=lisa-7b'] exits with return code = -9
and here is the related information:
[2023-08-22 11:45:50,020] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-22 11:45:52,164] [WARNING] [runner.py:201:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-08-22 11:45:52,224] [INFO] [runner.py:567:main] cmd = /home/TianYunjie/anaconda3/envs/lisa/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=24999 --enable_each_rank_log=None train_ds.py --version=liuhaotian/LLaVA-Lightning-7B-delta-v1-1 --dataset_dir=/home/ubuntu/Workspace/TianYunjie/datasets/LISA_datasets/ --vision_pretrained=sam_vit_h_4b8939.pth --dataset=sem_seg||refer_seg||vqa||reason_seg --sample_rates=9,3,3,1 --exp_name=lisa-7b [2023-08-22 11:45:54,525] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-22 11:45:56,379] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}