InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.93k stars 308 forks source link

Errors of llava pretrain for phi3_mini_4k_instruct_clip_vit_large_p14_336 #713

Open JiamingLv opened 5 months ago

JiamingLv commented 5 months ago

I strictly follow the document for phi3_mini_4k_instruct_clip_vit_large_p14_336. Run comand NPROC_PER_NODE=4 xtuner train llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2 --seed 1024

conda environment python==3.10 transformers==4.41.1 torch==2.3.0 CUDA 12.1 4x3090

05/23 08:10:20 - mmengine - INFO - before_train in EvaluateChatHook. You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model. [rank3]: Traceback (most recent call last): [rank3]: File "/media/ljm/anaconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1271, in call_hook [rank3]: getattr(hook, fn_name)(self, *kwargs) [rank3]: File "/home/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 230, in before_train [rank3]: self._generate_samples(runner, max_new_tokens=50) [rank3]: File "/home/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 216, in _generate_samples [rank3]: self._eval_images(runner, model, device, max_new_tokens, [rank3]: File "/home/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 148, in _eval_images [rank3]: generation_output = model.generate( [rank3]: File "/media/ljm/anaconda3/envs/xtuner/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [rank3]: return func(args, **kwargs) [rank3]: File "/media/ljm/anaconda3/envs/xtuner/lib/python3.10/site-packages/transformers/generation/utils.py", line 1758, in generate [rank3]: result = self._sample( [rank3]: File "/media/ljm/anaconda3/envs/xtuner/lib/python3.10/site-packages/transformers/generation/utils.py", line 2390, in _sample [rank3]: model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) [rank3]: File "/media/ljm/anaconda3/envs/xtuner/lib/python3.10/site-packages/transformers/generation/utils.py", line 1321, in _get_initial_cache_position [rank3]: past_length = model_kwargs["past_key_values"][0][0].shape[2] [rank3]: TypeError: 'NoneType' object is not subscriptable

acdart commented 5 months ago

downgrade transformers can solve this

J0eky commented 5 months ago

downgrade transformers can solve this @acdart hi , now my transformer is 4.41.1, Which version of Transformer should it be downgraded to?

nakoeni commented 5 months ago

How were you even able to run that command? It tells me that it doesn't recognize NPROC_PER_NODE=4, and if I run it without that bit (i.e. just running xtuner train llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2 --seed 1024), it says it doesn't know what llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain is.