Open Eren-yeager-zero opened 11 months ago
这个应该是显存不够导致的,把进程 kill 掉了
这个应该是显存不够导致的,把进程 kill 掉了
我是两张32GB的V100,应该是够的呀。请问是哪里还需要专门设置一下吗?
我之前遇到这个错误都是显存不够引起的,您可以再检查一下,以及可以看下其他人有没有遇到类似的问题
我之前遇到这个错误都是显存不够引起的,您可以再检查一下,以及可以看下其他人有没有遇到类似的问题
您好,我在运行web_demo的时候可以成功加载vicuna,但在训练的时候就一直被kill,想请教一下可能是哪里出了问题。初次接触,请望指点,万分感谢!
这个可能并不是显存不够,也有可能是内存不够,我之前64G内存并不一定够加载模型,需要降低加载模型时内存的使用
[!] load base configuration: config/base.yaml [!] load configuration from config/openllama_peft.yaml [2023-12-01 15:10:58,146] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [!] load base configuration: config/base.yaml [!] load configuration from config/openllama_peft.yaml [!] collect 161151 samples for training Initializing visual encoder from ../pretrained_ckpt/imagebind_ckpt/imagebind_huge.pth ... [!] collect 161151 samples for training Initializing visual encoder from ../pretrained_ckpt/imagebind_ckpt/imagebind_huge.pth ... Visual encoder initialized. Initializing language decoder from ../pretrained_ckpt/vicuna_ckpt/7b_v0/ ... Visual encoder initialized. Initializing language decoder from ../pretrained_ckpt/vicuna_ckpt/7b_v0/ ... Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s][2023-12-01 15:14:47,133] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 383754 [2023-12-01 15:14:47,133] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 383755 [2023-12-01 15:14:49,694] [ERROR] [launch.py:434:sigkill_handler] ['/root/miniconda3/envs/AnomalyGPT_env/bin/python', '-u', 'train_mvtec.py', '--local_rank=1', '--model', 'openllama_peft', '--stage', '1', '--imagebind_ckpt_path', '../pretrained_ckpt/imagebind_ckpt/imagebind_huge.pth', '--vicuna_ckpt_path', '../pretrained_ckpt/vicuna_ckpt/7b_v0/', '--delta_ckpt_path', '../pretrained_ckpt/pandagpt_ckpt/7b/pytorch_model.pt', '--max_tgt_len', '1024', '--data_path', '../data/pandagpt4_visual_instruction_data.json', '--image_root_path', '../data/images/', '--save_path', './ckpt/train_mvtec/', '--log_path', './ckpt/train_mvtec/log_rest/'] exits with return code = -9
在模型加载预训练的vicuna时出现了错误【self.llama_model = LlamaForCausalLM.from_pretrained(vicuna_ckpt_path)】 请教一下如何解决,万分感谢!