InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.71k stars 302 forks source link

无法启动训练,似乎是mmengine有问题 #691

Open Dominic23331 opened 4 months ago

Dominic23331 commented 4 months ago

我在训练时输出以下内容后,程序就停止了,请问这种情况该如何解决? `2024-05-15 09:29:44.939294: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-05-15 09:29:44.939347: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-05-15 09:29:44.940554: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered [2024-05-15 09:29:49,373] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) 2024-05-15 09:30:12.273661: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-05-15 09:30:12.273709: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-05-15 09:30:12.274819: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered [2024-05-15 09:30:16,168] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) 05/15 09:30:19 - mmengine - INFO -

System environment: sys.platform: linux Python: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] CUDA available: True MUSA available: False numpy_random_seed: 1102040617 GPU 0: B1.gpu.medium CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.2, V12.2.140 GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 PyTorch: 2.1.0a0+32f93b1 PyTorch compiling details: PyTorch built with:

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 1102040617 deterministic: False Distributed launcher: none Distributed training: False GPU number: 1

05/15 09:30:19 - mmengine - INFO - Config: SYSTEM = 'xtuner.utils.SYSTEM_TEMPLATE.alpaca' accumulative_counts = 16 alpaca_en = dict( dataset=dict(path='./alpaca', type='datasets.load_dataset'), dataset_map_fn='xtuner.dataset.map_fns.alpaca_map_fn', max_length=2048, pack_to_max_length=True, remove_unused_columns=True, shuffle_before_pack=True, template_map_fn=dict( template='xtuner.utils.PROMPT_TEMPLATE.chatglm3', type='xtuner.dataset.map_fns.template_map_fn_factory'), tokenizer=dict( encode_special_tokens=True, padding_side='left', pretrained_model_name_or_path='/gemini/pretrain', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.dataset.process_hf_dataset', use_varlen_attn=False) alpaca_en_path = './alpaca' batch_size = 1 betas = ( 0.9, 0.999, ) custom_hooks = [ dict( tokenizer=dict( encode_special_tokens=True, padding_side='left', pretrained_model_name_or_path='/gemini/pretrain', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.engine.hooks.DatasetInfoHook'), dict( evaluation_inputs=[ '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai', ], every_n_iters=500, prompt_template='xtuner.utils.PROMPT_TEMPLATE.chatglm3', system='xtuner.utils.SYSTEM_TEMPLATE.alpaca', tokenizer=dict( encode_special_tokens=True, padding_side='left', pretrained_model_name_or_path='/gemini/pretrain', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.engine.hooks.EvaluateChatHook'), ] dataloader_num_workers = 0 default_hooks = dict( checkpoint=dict( by_epoch=False, interval=500, max_keep_ckpts=2, type='mmengine.hooks.CheckpointHook'), logger=dict( interval=10, log_metric_by_epoch=False, type='mmengine.hooks.LoggerHook'), param_scheduler=dict(type='mmengine.hooks.ParamSchedulerHook'), sampler_seed=dict(type='mmengine.hooks.DistSamplerSeedHook'), timer=dict(type='mmengine.hooks.IterTimerHook')) env_cfg = dict( cudnn_benchmark=False, dist_cfg=dict(backend='nccl'), mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0)) evaluation_freq = 500 evaluation_inputs = [ '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai', ] launcher = 'none' load_from = None log_level = 'INFO' log_processor = dict(by_epoch=False) lr = 0.0002 max_epochs = 3 max_length = 2048 max_norm = 1 model = dict( llm=dict( pretrained_model_name_or_path='/gemini/pretrain', quantization_config=dict( bnb_4bit_compute_dtype='torch.float16', bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, llm_int8_has_fp16_weight=False, llm_int8_threshold=6.0, load_in_4bit=True, load_in_8bit=False, type='transformers.BitsAndBytesConfig'), torch_dtype='torch.float16', trust_remote_code=True, type='transformers.AutoModelForCausalLM.from_pretrained'), lora=dict( bias='none', lora_alpha=16, lora_dropout=0.1, r=64, task_type='CAUSAL_LM', type='peft.LoraConfig'), type='xtuner.model.SupervisedFinetune', use_varlen_attn=False) optim_type = 'torch.optim.AdamW' optim_wrapper = dict( accumulative_counts=16, clip_grad=dict(error_if_nonfinite=False, max_norm=1), dtype='float16', loss_scale='dynamic', optimizer=dict( betas=( 0.9, 0.999, ), lr=0.0002, type='torch.optim.AdamW', weight_decay=0), type='mmengine.optim.AmpOptimWrapper') pack_to_max_length = True param_scheduler = [ dict( begin=0, by_epoch=True, convert_to_iter_based=True, end=0.09, start_factor=1e-05, type='mmengine.optim.LinearLR'), dict( begin=0.09, by_epoch=True, convert_to_iter_based=True, end=3, eta_min=0.0, type='mmengine.optim.CosineAnnealingLR'), ] pretrained_model_name_or_path = '/gemini/pretrain' prompt_template = 'xtuner.utils.PROMPT_TEMPLATE.chatglm3' randomness = dict(deterministic=False, seed=None) resume = False save_steps = 500 save_total_limit = 2 tokenizer = dict( encode_special_tokens=True, padding_side='left', pretrained_model_name_or_path='/gemini/pretrain', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained') train_cfg = dict(max_epochs=3, type='xtuner.engine.runner.TrainLoop') train_dataloader = dict( batch_size=1, collate_fn=dict( type='xtuner.dataset.collate_fns.default_collate_fn', use_varlen_attn=False), dataset=dict( dataset=dict(path='./alpaca', type='datasets.load_dataset'), dataset_map_fn='xtuner.dataset.map_fns.alpaca_map_fn', max_length=2048, pack_to_max_length=True, remove_unused_columns=True, shuffle_before_pack=True, template_map_fn=dict( template='xtuner.utils.PROMPT_TEMPLATE.chatglm3', type='xtuner.dataset.map_fns.template_map_fn_factory'), tokenizer=dict( encode_special_tokens=True, padding_side='left', pretrained_model_name_or_path='/gemini/pretrain', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.dataset.process_hf_dataset', use_varlen_attn=False), num_workers=0, sampler=dict(shuffle=True, type='mmengine.dataset.DefaultSampler')) use_varlen_attn = False visualizer = None warmup_ratio = 0.03 weight_decay = 0 work_dir = './work_dirs/chatglm3_6b_base_qlora_alpaca_e3_copy'

quantization_config convert to <class 'transformers.utils.quantization_config.BitsAndBytesConfig'> 05/15 09:30:19 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized. low_cpu_mem_usage was None, now set to True since model is quantized.`

hhaAndroid commented 3 months ago

请问解决了吗?你贴的信息有点多,麻烦可以说一下是哪个配置文件?启动命令?以及你是否哪里做了修改吗

liuwake commented 1 month ago

我遇到了相同的问题,我通过修改训练启动命令解决了.

Runtime environment: launcher: none randomness: {'seed': None, 'deterministic': False} cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: None deterministic: False Distributed launcher: none Distributed training: False GPU number: 1

07/29 22:44:30 - mmengine - INFO - Config: SYSTEM = '' accumulative_counts = 1 batch_size = 1 betas = ( 0.9, 0.999, ) custom_hooks = [ dict( tokenizer=dict( padding_side='right', pretrained_model_name_or_path='/home/aistudio/data/internlm2-1_8b', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.engine.hooks.DatasetInfoHook'), dict( evaluation_images='https://llava-vl.github.io/static/images/view.jpg', evaluation_inputs=[ 'Please describe this picture', 'What is the equipment in the image?', ], every_n_iters=500, image_processor=dict( pretrained_model_name_or_path= '/home/aistudio/.cache/modelscope/hub/AI-ModelScope/clip-vit-large-patch14-336', trust_remote_code=True, type='transformers.CLIPImageProcessor.from_pretrained'), prompt_template='xtuner.utils.PROMPT_TEMPLATE.internlm2_chat', system='', tokenizer=dict( padding_side='right', pretrained_model_name_or_path='/home/aistudio/data/internlm2-1_8b', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.engine.hooks.EvaluateChatHook'), ] data_path = '/home/aistudio/llava/llava_data/repeated_data.json' data_root = '/home/aistudio/llava/llava_data/' dataloader_num_workers = 4 default_hooks = dict( checkpoint=dict( by_epoch=False, interval=500, max_keep_ckpts=2, type='mmengine.hooks.CheckpointHook'), logger=dict( interval=10, log_metric_by_epoch=False, type='mmengine.hooks.LoggerHook'), param_scheduler=dict(type='mmengine.hooks.ParamSchedulerHook'), sampler_seed=dict(type='mmengine.hooks.DistSamplerSeedHook'), timer=dict(type='mmengine.hooks.IterTimerHook')) env_cfg = dict( cudnn_benchmark=False, dist_cfg=dict(backend='nccl'), mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0)) evaluation_freq = 500 evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg' evaluation_inputs = [ 'Please describe this picture', 'What is the equipment in the image?', ] image_folder = '/home/aistudio/llava/llava_data/' image_processor = dict( pretrained_model_name_or_path= '/home/aistudio/.cache/modelscope/hub/AI-ModelScope/clip-vit-large-patch14-336', trust_remote_code=True, type='transformers.CLIPImageProcessor.from_pretrained') launcher = 'none' llava_dataset = dict( data_path='/home/aistudio/llava/llava_data/repeated_data.json', dataset_map_fn='xtuner.dataset.map_fns.llava_map_fn', image_folder='/home/aistudio/llava/llava_data/', image_processor=dict( pretrained_model_name_or_path= '/home/aistudio/.cache/modelscope/hub/AI-ModelScope/clip-vit-large-patch14-336', trust_remote_code=True, type='transformers.CLIPImageProcessor.from_pretrained'), max_length=1472, pad_image_to_square=True, template_map_fn=dict( template='xtuner.utils.PROMPT_TEMPLATE.internlm2_chat', type='xtuner.dataset.map_fns.template_map_fn_factory'), tokenizer=dict( padding_side='right', pretrained_model_name_or_path='/home/aistudio/data/internlm2-1_8b', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.dataset.LLaVADataset') llm_name_or_path = '/home/aistudio/data/internlm2-1_8b' load_from = None log_level = 'INFO' log_processor = dict(by_epoch=False) lr = 0.0002 max_epochs = 1 max_length = 1472 max_norm = 1 model = dict( freeze_llm=True, freeze_visual_encoder=True, llm=dict( pretrained_model_name_or_path='/home/aistudio/data/internlm2-1_8b', quantization_config=dict( bnb_4bit_compute_dtype='torch.float16', bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, llm_int8_has_fp16_weight=False, llm_int8_threshold=6.0, load_in_4bit=True, load_in_8bit=False, type='transformers.BitsAndBytesConfig'), torch_dtype='torch.float16', trust_remote_code=True, type='transformers.AutoModelForCausalLM.from_pretrained'), llm_lora=dict( bias='none', lora_alpha=256, lora_dropout=0.05, r=512, task_type='CAUSAL_LM', type='peft.LoraConfig'), pretrained_pth='/home/aistudio/llava/iter_2181.pth', type='xtuner.model.LLaVAModel', visual_encoder=dict( pretrained_model_name_or_path= '/home/aistudio/.cache/modelscope/hub/AI-ModelScope/clip-vit-large-patch14-336', type='transformers.CLIPVisionModel.from_pretrained'), visual_encoder_lora=dict( bias='none', lora_alpha=16, lora_dropout=0.05, r=64, type='peft.LoraConfig')) optim_type = 'torch.optim.AdamW' optim_wrapper = dict( optimizer=dict( betas=( 0.9, 0.999, ), lr=0.0002, type='torch.optim.AdamW', weight_decay=0), type='DeepSpeedOptimWrapper') param_scheduler = [ dict( begin=0, by_epoch=True, convert_to_iter_based=True, end=0.03, start_factor=1e-05, type='mmengine.optim.LinearLR'), dict( begin=0.03, by_epoch=True, convert_to_iter_based=True, end=1, eta_min=0.0, type='mmengine.optim.CosineAnnealingLR'), ] pretrained_pth = '/home/aistudio/llava/iter_2181.pth' prompt_template = 'xtuner.utils.PROMPT_TEMPLATE.internlm2_chat' randomness = dict(deterministic=False, seed=None) resume = False runner_type = 'FlexibleRunner' save_steps = 500 save_total_limit = 2 strategy = dict( config=dict( bf16=dict(enabled=False), fp16=dict(enabled=True, initial_scale_power=16), gradient_accumulation_steps='auto', gradient_clipping='auto', train_micro_batch_size_per_gpu='auto', zero_allow_untested_optimizer=True, zero_force_ds_cpu_optimizer=False, zero_optimization=dict(overlap_comm=True, stage=2)), exclude_frozen_parameters=True, gradient_accumulation_steps=1, gradient_clipping=1, sequence_parallel_size=1, train_micro_batch_size_per_gpu=1, type='xtuner.engine.DeepSpeedStrategy') tokenizer = dict( padding_side='right', pretrained_model_name_or_path='/home/aistudio/data/internlm2-1_8b', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained') train_cfg = dict(max_epochs=1, type='xtuner.engine.runner.TrainLoop') train_dataloader = dict( batch_size=1, collate_fn=dict(type='xtuner.dataset.collate_fns.default_collate_fn'), dataset=dict( data_path='/home/aistudio/llava/llava_data/repeated_data.json', dataset_map_fn='xtuner.dataset.map_fns.llava_map_fn', image_folder='/home/aistudio/llava/llava_data/', image_processor=dict( pretrained_model_name_or_path= '/home/aistudio/.cache/modelscope/hub/AI-ModelScope/clip-vit-large-patch14-336', trust_remote_code=True, type='transformers.CLIPImageProcessor.from_pretrained'), max_length=1472, pad_image_to_square=True, template_map_fn=dict( template='xtuner.utils.PROMPT_TEMPLATE.internlm2_chat', type='xtuner.dataset.map_fns.template_map_fn_factory'), tokenizer=dict( padding_side='right', pretrained_model_name_or_path='/home/aistudio/data/internlm2-1_8b', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.dataset.LLaVADataset'), num_workers=4, pin_memory=True, sampler=dict( length_property='modality_length', per_device_batch_size=1, type='xtuner.dataset.samplers.LengthGroupedSampler')) visual_encoder_name_or_path = '/home/aistudio/.cache/modelscope/hub/AI-ModelScope/clip-vit-large-patch14-336' visualizer = None warmup_ratio = 0.03 weight_decay = 0 work_dir = './work_dirs/llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy'

07/29 22:44:31 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized. ^C Traceback (most recent call last): File "/home/aistudio/.local/lib/python3.10/site-packages/xtuner/tools/train.py", line 360, in main() File "/home/aistudio/.local/lib/python3.10/site-packages/xtuner/tools/train.py", line 353, in main runner = RUNNERS.build(cfg) File "/home/aistudio/.local/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, *kwargs, registry=self) File "/home/aistudio/.local/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 196, in build_runner_from_cfg runner = runner_cls.from_cfg(args) # type: ignore File "/home/aistudio/.local/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 423, in from_cfg runner = cls( File "/home/aistudio/.local/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 403, in init self.register_hooks(default_hooks, custom_hooks) File "/home/aistudio/.local/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1430, in register_hooks self.register_custom_hooks(custom_hooks) File "/home/aistudio/.local/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1410, in register_custom_hooks self.register_hook(hook) File "/home/aistudio/.local/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1310, in register_hook hook_obj = HOOKS.build(hook) File "/home/aistudio/.local/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, args, kwargs, registry=self) File "/home/aistudio/.local/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(args) # type: ignore File "/home/aistudio/.local/lib/python3.10/site-packages/xtuner/engine/hooks/evaluate_chat_hook.py", line 48, in init self.evaluation_images = [ File "/home/aistudio/.local/lib/python3.10/site-packages/xtuner/engine/hooks/evaluate_chat_hook.py", line 49, in load_image(img) for img in self.evaluation_images File "/home/aistudio/.local/lib/python3.10/site-packages/xtuner/dataset/utils.py", line 261, in load_image response = requests.get(image_file) File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/requests/sessions.py", line 746, in send r.content File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/requests/models.py", line 902, in content self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b"" File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/requests/models.py", line 820, in generate yield from self.raw.stream(chunk_size, decode_content=True) File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/urllib3/response.py", line 1060, in stream data = self.read(amt=amt, decode_content=decode_content) File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/urllib3/response.py", line 949, in read data = self._raw_read(amt) File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/urllib3/response.py", line 873, in _raw_read data = self._fp_read(amt, read1=read1) if not fp_closed else b"" File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/urllib3/response.py", line 856, in _fp_read return self._fp.read(amt) if amt is not None else self._fp.read() File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/http/client.py", line 465, in read s = self.fp.read(amt) File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/socket.py", line 705, in readinto return self._sock.recv_into(b) File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/ssl.py", line 1274, in recv_into return self.read(nbytes, buffer) File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/ssl.py", line 1130, in read return self._sslobj.read(len, buffer) KeyboardInterrupt


- 我尝试在jupyter中:
    - 重新安装xtuner;
    - 安装openmim,
    - 重新用绝对路径执行xtuner,
    - 用%%魔术指令执行xtuner,
    - 用`import os
os.environ['PATH'] += ':/home/aistudio/.local/bin'`
    - 都没有任何效果.
- 最后我去掉了指令中的`deepspped`字样,也就是执行了`!/home/aistudio/.local/bin/xtuner train     /home/aistudio/llava/llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy.py`,训练正常执行了!
liuwake commented 1 month ago

我遇到了相同的问题,我通过修改训练启动命令解决了.

  • 我遇到了同一样报错07/29 22:44:31 - mmengine - �[5m�[4m�[33mWARNING�[0m - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized..甚至于卡在此处50分钟一动不动,也没有GPU和VRAM占用.
liuwake commented 1 month ago

DEBUG成功:我把nijia的环境加入了PATH,现在可以使用deepspeed了,完全解决了问题.

目前来看并不是因为显存不足,同时我在两张T4上能够正常启动DeepSpeed训练 怀疑是DeepSpeed安装的问题,建议您可以尝试使用命令ds_report检查一下是否有错误? 如果上述命令一切正常,可以尝试运行一些DeepSpeed官方提供的examples脚本,如DeepSpeed_CIFAR,验证DeepSpeed能否正常启动~

  • 我测试了这个官方检测脚本,发现是ninjia没有识别.当我在jupyter中增加
    import os
    os.environ['PATH'] += ':/home/aistudio/.local/bin'
    # for ninja
    os.environ['PATH'] += ':/home/aistudio/.local/lib/python3.10/site-packages/ninja/data/bin'

    之后,使用deepspeed训练便完全正常了! 附上命令!/home/aistudio/.local/bin/xtuner train \ /home/aistudio/llava/llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy.py \ --deepspeed deepspeed_zero2 附上出现ETA时常输出的前后几行:

    You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
    07/30 13:58:09 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
    07/30 13:58:09 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
    07/30 13:58:09 - mmengine - INFO - Checkpoints will be saved to /home/aistudio/llava/work_dirs/llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_copy.
    [2024-07-30 13:58:10,244] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
    [2024-07-30 13:58:10,904] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
    07/30 13:58:17 - mmengine - INFO - Iter(train) [  10/1200]  lr: 5.1430e-05  eta: 0:15:16  time: 0.7705  data_time: 0.0138  memory: 18114  loss: 1.3578
    07/30 13:58:24 - mmengine - INFO - Iter(train) [  20/1200]  lr: 1.0857e-04  eta: 0:15:06  time: 0.7654  data_time: 0.0144  memory: 18113  loss: 0.5315
    07/30 13:58:32 - mmengine - INFO - Iter(train) [  30/1200]  lr: 1.6571e-04  eta: 0:15:00  time: 0.7736  data_time: 0.0147  memory: 18113  loss: 0.3846
    07/30 13:58:39 - mmengine - INFO - Iter(train) [  40/1200]  lr: 2.0000e-04  eta: 0:14:42  time: 0.7338  data_time: 0.0147  memory: 18113  loss: 0.5993
    07/30 13:58:47 - mmengine - INFO - Iter(train) [  50/1200]  lr: 1.9994e-04  eta: 0:14:22  time: 0.7087  data_time: 0.0144  memory: 18113  loss: 0.5120
    07/30 13:58:54 - mmengine - INFO - Iter(train) [  60/1200]  lr: 1.9981e-04  eta: 0:14:20  time: 0.7770  data_time: 0.0145  memory: 18113  loss: 0.2459