InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.73k stars 302 forks source link

1.8b模型上微调报错 KeyError: 'text' #442

Closed zhanghui-china closed 6 months ago

zhanghui-china commented 6 months ago

xtuner 最新版0.1.15dev0 以及 xtuner0.1.13

1.8b微调脚本不知道选择哪个,沿用了以前的脚本: xtuner copy-cfg internlm2_chat_7b_qlora_oasst1_e3 .

报错如下: ··· (xtuner0305) zhanghui@zhanghui:~/shishen18$ xtuner train ./internlm2_chat_7b_qlora_oasst1_e3_copy.py --deepspeed deepspeed_zero2 [2024-03-04 23:12:07,542] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-03-04 23:12:10,204] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) 03/04 23:12:11 - mmengine - INFO -

System environment: sys.platform: linux Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 478825338 GPU 0: NVIDIA GeForce RTX 4090 GPU 1: NVIDIA GeForce RTX 3080 Laptop GPU CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.6, V11.6.124 GCC: gcc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0 PyTorch: 2.2.1+cu121 PyTorch compiling details: PyTorch built with:

Runtime environment: launcher: none randomness: {'seed': None, 'deterministic': False} cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: None deterministic: False Distributed launcher: none Distributed training: False GPU number: 1

03/04 23:12:11 - mmengine - INFO - Config: SYSTEM = '' accumulative_counts = 16 batch_size = 1 betas = ( 0.9, 0.999, ) custom_hooks = [ dict( tokenizer=dict( padding_side='right', pretrained_model_name_or_path= '/home/zhanghui/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.engine.DatasetInfoHook'), dict( evaluation_inputs=[ '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai', ], every_n_iters=500, prompt_template='xtuner.utils.PROMPT_TEMPLATE.internlm2_chat', system='', tokenizer=dict( padding_side='right', pretrained_model_name_or_path= '/home/zhanghui/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.engine.EvaluateChatHook'), ] data_path = './dataset/tran_dataset_0.json' dataloader_num_workers = 0 default_hooks = dict( checkpoint=dict(interval=1, type='mmengine.hooks.CheckpointHook'), logger=dict(interval=10, type='mmengine.hooks.LoggerHook'), param_scheduler=dict(type='mmengine.hooks.ParamSchedulerHook'), sampler_seed=dict(type='mmengine.hooks.DistSamplerSeedHook'), timer=dict(type='mmengine.hooks.IterTimerHook')) env_cfg = dict( cudnn_benchmark=False, dist_cfg=dict(backend='nccl'), mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0)) evaluation_freq = 500 evaluation_inputs = [ '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai', ] launcher = 'none' load_from = None log_level = 'INFO' lr = 0.0002 max_epochs = 1 max_length = 2048 max_norm = 1 model = dict( llm=dict( pretrained_model_name_or_path= '/home/zhanghui/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b', quantization_config=dict( bnb_4bit_compute_dtype='torch.float16', bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, llm_int8_has_fp16_weight=False, llm_int8_threshold=6.0, load_in_4bit=True, load_in_8bit=False, type='transformers.BitsAndBytesConfig'), torch_dtype='torch.float16', trust_remote_code=True, type='transformers.AutoModelForCausalLM.from_pretrained'), lora=dict( bias='none', lora_alpha=16, lora_dropout=0.1, r=64, task_type='CAUSAL_LM', type='peft.LoraConfig'), type='xtuner.model.SupervisedFinetune') optim_type = 'torch.optim.AdamW' optim_wrapper = dict( optimizer=dict( betas=( 0.9, 0.999, ), lr=0.0002, type='torch.optim.AdamW', weight_decay=0), type='DeepSpeedOptimWrapper') pack_to_max_length = True param_scheduler = [ dict( begin=0, by_epoch=True, convert_to_iter_based=True, end=0.03, start_factor=1e-05, type='mmengine.optim.LinearLR'), dict( T_max=1, begin=0.03, by_epoch=True, convert_to_iter_based=True, eta_min=0.0, type='mmengine.optim.CosineAnnealingLR'), ] pretrained_model_name_or_path = '/home/zhanghui/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b' prompt_template = 'xtuner.utils.PROMPT_TEMPLATE.internlm2_chat' randomness = dict(deterministic=False, seed=None) resume = False runner_type = 'FlexibleRunner' strategy = dict( config=dict( bf16=dict(enabled=True), fp16=dict(enabled=False, initial_scale_power=16), gradient_accumulation_steps='auto', gradient_clipping='auto', train_micro_batch_size_per_gpu='auto', zero_allow_untested_optimizer=True, zero_force_ds_cpu_optimizer=False, zero_optimization=dict(overlap_comm=True, stage=2)), exclude_frozen_parameters=True, gradient_accumulation_steps=16, gradient_clipping=1, train_micro_batch_size_per_gpu=1, type='xtuner.engine.DeepSpeedStrategy') tokenizer = dict( padding_side='right', pretrained_model_name_or_path= '/home/zhanghui/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained') train_cfg = dict(by_epoch=True, max_epochs=1, val_interval=1) train_dataloader = dict( batch_size=1, collate_fn=dict(type='xtuner.dataset.collate_fns.default_collate_fn'), dataset=dict( dataset=dict( data_files=dict(train='./dataset/tran_dataset_0.json'), path='json', type='datasets.load_dataset'), dataset_map_fn='xtuner.dataset.map_fns.oasst1_map_fn', max_length=2048, pack_to_max_length=True, remove_unused_columns=True, shuffle_before_pack=True, template_map_fn=dict( template='xtuner.utils.PROMPT_TEMPLATE.internlm2_chat', type='xtuner.dataset.map_fns.template_map_fn_factory'), tokenizer=dict( padding_side='right', pretrained_model_name_or_path= '/home/zhanghui/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.dataset.process_hf_dataset'), num_workers=0, sampler=dict(shuffle=True, type='mmengine.dataset.DefaultSampler')) train_dataset = dict( dataset=dict( data_files=dict(train='./dataset/tran_dataset_0.json'), path='json', type='datasets.load_dataset'), dataset_map_fn='xtuner.dataset.map_fns.oasst1_map_fn', max_length=2048, pack_to_max_length=True, remove_unused_columns=True, shuffle_before_pack=True, template_map_fn=dict( template='xtuner.utils.PROMPT_TEMPLATE.internlm2_chat', type='xtuner.dataset.map_fns.template_map_fn_factory'), tokenizer=dict( padding_side='right', pretrained_model_name_or_path= '/home/zhanghui/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b', trust_remote_code=True, type='transformers.AutoTokenizer.from_pretrained'), type='xtuner.dataset.process_hf_dataset') visualizer = None warmup_ratio = 0.03 weight_decay = 0 work_dir = './work_dirs/internlm2_chat_7b_qlora_oasst1_e3_copy'

03/04 23:12:11 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized. 03/04 23:12:12 - mmengine - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) RuntimeInfoHook
(BELOW_NORMAL) LoggerHook


before_train: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DatasetInfoHook
(LOW ) EvaluateChatHook
(VERY_LOW ) CheckpointHook


before_train_epoch: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DistSamplerSeedHook


before_train_iter: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook


after_train_iter: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(LOW ) EvaluateChatHook
(VERY_LOW ) CheckpointHook


after_train_epoch: (NORMAL ) IterTimerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook


before_val: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) DatasetInfoHook


before_val_epoch: (NORMAL ) IterTimerHook


before_val_iter: (NORMAL ) IterTimerHook


after_val_iter: (NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook


after_val_epoch: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook


after_val: (VERY_HIGH ) RuntimeInfoHook
(LOW ) EvaluateChatHook


after_train: (VERY_HIGH ) RuntimeInfoHook
(LOW ) EvaluateChatHook
(VERY_LOW ) CheckpointHook


before_test: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) DatasetInfoHook


before_test_epoch: (NORMAL ) IterTimerHook


before_test_iter: (NORMAL ) IterTimerHook


after_test_iter: (NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook


after_test_epoch: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook


after_test: (VERY_HIGH ) RuntimeInfoHook


after_run: (BELOW_NORMAL) LoggerHook


Generating train split: 100000 examples [00:01, 89208.28 examples/s] Map (num_proc=32): 0%| | 0/100000 [00:00<?, ? examples/s] multiprocess.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker result = (True, func(args, kwds)) File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 623, in _write_generator_to_queue for i, result in enumerate(func(kwargs)): File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3458, in _map_single example = apply_function_on_filtered_inputs(example, i, offset=offset) File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3361, in apply_function_on_filtered_inputs processed_inputs = function(fn_args, *additional_args, **fn_kwargs) File "/home/zhanghui/xtuner0305/xtuner/xtuner/dataset/map_fns/dataset_map_fns/oasst1_map_fn.py", line 22, in oasst1_map_fn for sentence in example['text'].strip().split('###'): File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 270, in getitem value = self.data[key] KeyError: 'text' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/zhanghui/xtuner0305/xtuner/xtuner/tools/train.py", line 307, in main() File "/home/zhanghui/xtuner0305/xtuner/xtuner/tools/train.py", line 303, in main runner.train() File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train self._train_loop = self.build_train_loop( File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 965, in build_train_loop loop = EpochBasedTrainLoop( File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/mmengine/runner/loops.py", line 44, in init super().init(runner, dataloader) File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/mmengine/runner/base_loop.py", line 26, in init self.dataloader = runner.build_dataloader( File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader dataset = DATASETS.build(dataset_cfg) File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, kwargs, registry=self) File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(args) # type: ignore File "/home/zhanghui/xtuner0305/xtuner/xtuner/dataset/huggingface.py", line 299, in process_hf_dataset return process(kwargs) File "/home/zhanghui/xtuner0305/xtuner/xtuner/dataset/huggingface.py", line 179, in process dataset = map_dataset(dataset, dataset_map_fn, map_num_proc) File "/home/zhanghui/xtuner0305/xtuner/xtuner/dataset/huggingface.py", line 50, in map_dataset dataset = dataset.map(dataset_map_fn, num_proc=map_num_proc) File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 663, in iflatmap_unordered [async_result.get(timeout=0.05) for async_result in async_results] File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 663, in [async_result.get(timeout=0.05) for async_result in async_results] File "/home/zhanghui/anaconda3/envs/xtuner0305/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get raise self._value KeyError: 'text' (xtuner0305) zhanghui@zhanghui:~/shishen18$ ···

LZHgrla commented 6 months ago
for sentence in example['text'].strip().split('###'):

所使用的数据集格式,与所使用的 map_fn 不对应。

zhanghui-china commented 6 months ago

哦。明白了。漏改了一个:dataset_map_fn=None,