训练时无法加载vicuna，但运行web_demo可以成功加载

Eren-yeager-zero commented 11 months ago

[!] load base configuration: config/base.yaml [!] load configuration from config/openllama_peft.yaml [2023-12-01 15:10:58,146] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [!] load base configuration: config/base.yaml [!] load configuration from config/openllama_peft.yaml [!] collect 161151 samples for training Initializing visual encoder from ../pretrained_ckpt/imagebind_ckpt/imagebind_huge.pth ... [!] collect 161151 samples for training Initializing visual encoder from ../pretrained_ckpt/imagebind_ckpt/imagebind_huge.pth ... Visual encoder initialized. Initializing language decoder from ../pretrained_ckpt/vicuna_ckpt/7b_v0/ ... Visual encoder initialized. Initializing language decoder from ../pretrained_ckpt/vicuna_ckpt/7b_v0/ ... Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s][2023-12-01 15:14:47,133] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 383754 [2023-12-01 15:14:47,133] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 383755 [2023-12-01 15:14:49,694] [ERROR] [launch.py:434:sigkill_handler] ['/root/miniconda3/envs/AnomalyGPT_env/bin/python', '-u', 'train_mvtec.py', '--local_rank=1', '--model', 'openllama_peft', '--stage', '1', '--imagebind_ckpt_path', '../pretrained_ckpt/imagebind_ckpt/imagebind_huge.pth', '--vicuna_ckpt_path', '../pretrained_ckpt/vicuna_ckpt/7b_v0/', '--delta_ckpt_path', '../pretrained_ckpt/pandagpt_ckpt/7b/pytorch_model.pt', '--max_tgt_len', '1024', '--data_path', '../data/pandagpt4_visual_instruction_data.json', '--image_root_path', '../data/images/', '--save_path', './ckpt/train_mvtec/', '--log_path', './ckpt/train_mvtec/log_rest/'] exits with return code = -9

在模型加载预训练的vicuna时出现了错误【self.llama_model = LlamaForCausalLM.from_pretrained(vicuna_ckpt_path)】请教一下如何解决，万分感谢！

FantasticGNU commented 11 months ago

这个应该是显存不够导致的，把进程 kill 掉了

Eren-yeager-zero commented 11 months ago

这个应该是显存不够导致的，把进程 kill 掉了

我是两张32GB的V100，应该是够的呀。请问是哪里还需要专门设置一下吗？

FantasticGNU commented 11 months ago

我之前遇到这个错误都是显存不够引起的，您可以再检查一下，以及可以看下其他人有没有遇到类似的问题

Eren-yeager-zero commented 11 months ago

我之前遇到这个错误都是显存不够引起的，您可以再检查一下，以及可以看下其他人有没有遇到类似的问题

您好，我在运行web_demo的时候可以成功加载vicuna,但在训练的时候就一直被kill，想请教一下可能是哪里出了问题。初次接触，请望指点，万分感谢！

kirasun23 commented 11 months ago

这个可能并不是显存不够，也有可能是内存不够，我之前64G内存并不一定够加载模型，需要降低加载模型时内存的使用

CASIA-IVA-Lab / AnomalyGPT

训练时无法加载vicuna，但运行web_demo可以成功加载 #50