Closed danyow-cheung closed 4 months ago
Run into the same problem. It seems all of a suddent both my finetuned model and the official model change to morons.
Especially, this is the result by the official "web_demo.py". I can't figure out the reason.
drives me crazy
did anyone find the solution to this?
same problem, after finetuning, I also don't get any output.
I'm encountering the same problem after LoRA fine-tuning. Are you inferencing your fine-tuned model with the infrastructure from chat.py, or are you loading the fine-tuned model manually with AutoPeftModelForCausalLM? If the latter, could you share your inference code?
I use the web_demo_2.5.py to inference, and directly change the model_path to the output checkpoint, don't modify anything.
As far as I have worked out, It seems that the hf-ds training code is problematic. I guess that is the reason why ds.sh is called "simple ft"...
I guess the problem is in trainer.py, which should be like:
if labels is not None:
labels = labels.to(lm_logits.device)
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
loss_fct = CrossEntropyLoss()
loss = loss_fct(
shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)
)
NOT
if labels is not None:
# Flatten the tokens
loss_fct = nn.CrossEntropyLoss()
logits = outputs.logits.view(-1,
self.model.config.vocab_size).contiguous()
labels = labels.view(-1).long().contiguous()
# Enable model parallelism
labels = labels.to(logits.device)
loss = loss_fct(logits, labels)
Hope my judgement is correct. Official code failure is quite disappointing... Now I am trying to fix the whole xie te. wait for my success...
As far as I have worked out, It seems that the hf-ds training code is problematic. I guess that is the reason why ds.sh is called "simple ft"...
I guess the problem is in trainer.py, which should be like:
if labels is not None: labels = labels.to(lm_logits.device) shift_logits = lm_logits[..., :-1, :].contiguous() shift_labels = labels[..., 1:].contiguous() loss_fct = CrossEntropyLoss() loss = loss_fct( shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1) )
NOT
if labels is not None: # Flatten the tokens loss_fct = nn.CrossEntropyLoss() logits = outputs.logits.view(-1, self.model.config.vocab_size).contiguous() labels = labels.view(-1).long().contiguous() # Enable model parallelism labels = labels.to(logits.device) loss = loss_fct(logits, labels)
Hope my judgement is correct. Official code failure is quite disappointing... Now I am trying to fix the whole xie te. wait for my success...
good luck !
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- [x] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
- [x] 我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
I currently use the default command to finetune model . command info is
(llava) hs@hs-System-Product-Name:/media/hs/DATA/ubuntu/code/MiniCPM-V/finetune$ python finetune.py --model_name_or_path dia/hs/DATA/ubuntu/code/pokemon-blip-captions-en-zh/json_files/train.json --eval_data_path /media/hs/DATA/ubuntu/code/pd_columns false --label_names "labels" --prediction_loss_only false --bf16 true --bf16_full_eval true _eval --tune_vision false --tune_llm false --model_max_length 512 --max_slice_nums 1 --scale_resolutiodir output/output_minicpmv2 --logging_dir output/output_minicpmv2 --logging_strategy "steps" --per_device_trait_accumulation_steps 1 --evaluation_strategy "steps" --save_strategy "steps" --save_steps 1000 --save_tota--adam_beta2 0.95 --warmup_ratio 0.01 --lr_scheduler_type "cosine" --logging_steps 1 --gradient_checkpointtensorboard"
tensorboard alse looked like correct
期望行为 | Expected Behavior
during the test phrase , I replace the model_name_path to output_minicpmv2 path
something went wrong, model stopped to give response or gave empty response
-------- <User>: what is the image about <Assistant>: a before and after situation <User>: what is the image about <Assistant>: pokemon <User>: anything more ? <Assistant>: <User>: what is the color of the imaeg <Assistant>: <User>: what is the image about <Assistant>: <User>: what is the image about <Assistant>: <User>: what is the image about <Assistant>: a ball with an open mouth <User>: what is the color <Assistant>: <User>: 熬 你又不行了 <Assistant>: <User>: 总结一下图片中的信息 <Assistant>: ----------- <User>: 图片中有什么信息 <Assistant>: <User>: what is the image about ? <Assistant>: <User>: what is the image about ? <Assistant>: 4 <User>: what is the info <Assistant>:
复现方法 | Steps To Reproduce
I think the main problem came from my
train.json
file , so I give some samples here[{"id": 708, "image": "/home/hs/common/code/pokemon-blip-captions-en-zh/raw_data/images/708.png", "conversations": [{"role": "user", "content": "<image>\n What is the image about ?"}, {"role": "assistant", "content": "a red bird with black wings flying through the air"}]}, {"id": 709, "image": "/home/hs/common/code/pokemon-blip-captions-en-zh/raw_data/images/709.png", "conversations": [{"role": "user", "content": "<image>\n What is the image about ?"}, {"role": "assistant", "content": "a cartoon character flying through the air"}]}, {"id": 710, "image": "/home/hs/common/code/pokemon-blip-captions-en-zh/raw_data/images/710.png", "conversations": [{"role": "user", "content": "<image>\n What is the image about ?"}, {"role": "assistant", "content": "a drawing of a cartoon character with eyes and a nose"}]}, {"id": 711, "image": "/home/hs/common/code/pokemon-blip-captions-en-zh/raw_data/images/711.png", "conversations": [{"role": "user", "content": "<image>\n What is the image about ?"}, {"role": "assistant", "content": "a picture of a butterfly made out of paper"}]},]
I use the text2 image dateset
pokemon-blip-captions-en-zh
and just simply restruct my json file运行环境 | Environment
- OS: Ubuntu 23.10 - Python: 3.10.14 - Transformers: 4.40.0 - PyTorch: 2.1.2 - CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 12.1
备注 | Anything else?
No response
大佬请教一下,你微调的时候用的deepspeed那个版本,还有cuda,我使用cuda11.8,其他都是严格按照requirement.txt中install的,但是总是报错deepspeed,各个版本都试过了,还是不行,
RuntimeError: Error building extension 'fused_adam'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9851) of binary: /public/home/user/miniconda3/envs/llm/bin/python
Traceback (most recent call last):
详细的错误信息如下
Detected CUDA files, patching ldflags
Emitting ninja build file /public/home/user/.cache/torch_extensions/py310_cu118/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Traceback (most recent call last):
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
Loading extension module fused_adam...Loading extension module fused_adam...
subprocess.run(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 323, in <module>
Traceback (most recent call last):
File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 323, in <module>
train()
File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 313, in train
trainer.train()
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
train()
File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 313, in train
Traceback (most recent call last):
File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 323, in <module>
trainer.train()
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
train()
File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 313, in train
trainer.train()
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2015, in _inner_training_loop
return inner_training_loop(
return inner_training_loop( File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2015, in _inner_training_loop
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2015, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
result = self._prepare_deepspeed(*args)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
result = self._prepare_deepspeed(*args)result = self._prepare_deepspeed(*args)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/__init__.py", line 176, in initialize
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/__init__.py", line 176, in initialize
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/__init__.py", line 176, in initialize
engine = DeepSpeedEngine(args=args,
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 307, in __init__
engine = DeepSpeedEngine(args=args,
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 307, in __init__
engine = DeepSpeedEngine(args=args,
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 307, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_optimizer
self._configure_optimizer(optimizer, model_parameters)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_optimizer
self._configure_optimizer(optimizer, model_parameters)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)optimizer = FusedAdam(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
optimizer = FusedAdam(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
optimizer = FusedAdam(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
Loading extension module fused_adam...
Traceback (most recent call last):
File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 323, in <module>
train()
File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 313, in train
trainer.train()
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2015, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
result = self._prepare_deepspeed(*args)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/__init__.py", line 176, in initialize
engine = DeepSpeedEngine(args=args,
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 307, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
fused_adam_cuda = FusedAdamBuilder().load()
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load
fused_adam_cuda = FusedAdamBuilder().load()
fused_adam_cuda = FusedAdamBuilder().load() File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load
fused_adam_cuda = FusedAdamBuilder().load()
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load
return self.jit_load(verbose)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load
return self.jit_load(verbose)
return self.jit_load(verbose) File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load
return self.jit_load(verbose) File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load
op_module = load(name=self.name,
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
op_module = load(name=self.name,
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
op_module = load(name=self.name,
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
op_module = load(name=self.name,
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
return _jit_compile(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
return _jit_compile(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
return _jit_compile(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
module = importlib.util.module_from_spec(spec)
File "<frozen importlib._bootstrap>", line 571, in module_from_spec
return _import_module_from_library(name, build_directory, is_python_module)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
_write_ninja_file_and_build_library(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
return _import_module_from_library(name, build_directory, is_python_module)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
File "<frozen importlib._bootstrap_external>", line 1176, in create_module
module = importlib.util.module_from_spec(spec)
File "<frozen importlib._bootstrap>", line 571, in module_from_spec
_run_ninja_build(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
File "<frozen importlib._bootstrap_external>", line 1176, in create_module
module = importlib.util.module_from_spec(spec)
File "<frozen importlib._bootstrap>", line 571, in module_from_spec
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /public/home/user/.cache/torch_extensions/py310_cu118/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
File "<frozen importlib._bootstrap_external>", line 1176, in create_module
ImportError: /public/home/user/.cache/torch_extensions/py310_cu118/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /public/home/user/.cache/torch_extensions/py310_cu118/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9851) of binary: /public/home/user/miniconda3/envs/llm/bin/python
Traceback (most recent call last):
File "/public/home/user/miniconda3/envs/llm/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-06-07_12:45:07
host : gpu008.cluster.cn
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 9852)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-06-07_12:45:07
host : gpu008.cluster.cn
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 9853)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-06-07_12:45:07
host : gpu008.cluster.cn
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 9854)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-06-07_12:45:07
host : gpu008.cluster.cn
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 9851)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@JinQiangWang2021
你需要改
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
改成['ninja','--version']
我的deepspeed版本是 0.14.2
As far as I have worked out, It seems that the hf-ds training code is problematic. I guess that is the reason why ds.sh is called "simple ft"... I guess the problem is in trainer.py, which should be like:
if labels is not None: labels = labels.to(lm_logits.device) shift_logits = lm_logits[..., :-1, :].contiguous() shift_labels = labels[..., 1:].contiguous() loss_fct = CrossEntropyLoss() loss = loss_fct( shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1) )
NOT
if labels is not None: # Flatten the tokens loss_fct = nn.CrossEntropyLoss() logits = outputs.logits.view(-1, self.model.config.vocab_size).contiguous() labels = labels.view(-1).long().contiguous() # Enable model parallelism labels = labels.to(logits.device) loss = loss_fct(logits, labels)
Hope my judgement is correct. Official code failure is quite disappointing... Now I am trying to fix the whole xie te. wait for my success...
good luck !
Unfortunately, I was wrong again. The data is already shifted. ....郁闷了
我这边也遇到了这个问题 确认下训练后保存的tokenizer_config.json,"chat_template"是否和原始模型一致(末尾要拼上'<|start_header_id|>assistant<|end_header_id|>\n\n)
我这边也遇到了这个问题 确认下训练后保存的tokenizer_config.json,"chat_template"是否和原始模型一致(末尾要拼上'<|start_header_id|>assistant<|end_header_id|>\n\n)
This is the actual problem i think. because the finetune config doesn't have '<|start_header_id|>assistant<|end_header_id|>\n\n)at the end. adding it manually atleast solves the empty response problem. Would be helpful if someone can explain why that part is not there by default in the tokenizer_config.json.
我说一下我的情况,我也是出现空响应: 训练后的tokenizer_config.json里面的'chat_template'是:
"chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}"
原始模型tokenizer_config.json里面的'chat_template'是:
"chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}"
两者确实不一样,但不清楚是否是造成问题的主要原因。
更新一下,在将{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}"
添加到生成的tokenizer_config.json里面的'chat_template'末尾后,我解决了输出为空的问题。
我这边也遇到了这个问题 确认下训练后保存的tokenizer_config.json,"chat_template"是否和原始模型一致(末尾要拼上'<|start_header_id|>assistant<|end_header_id|>\n\n)
This is the actual problem i think. because the finetune config doesn't have '<|start_header_id|>assistant<|end_header_id|>\n\n)at the end. adding it manually atleast solves the empty response problem. Would be helpful if someone can explain why that part is not there by default in the tokenizer_config.json.
Because the chat template is hard-coded in finetune.py. the chat template in tokenizerconfig.json is overwrited.--
看了下代码,这块的 chat template 应该就是导致推理异常的原因。 minicpm-v 原始模型配置里的 chat template 是推理时候用的,最后拼上了assistant,没问题。 但是在微调的时候,在 finetune.py 里用不带 assitant的覆盖了(训练的时候确实不应该带),所以最后保存的配置也变成了不带 assistant。 这样推理的时候和训练是不一致的
是的,这个是问题所在。
您好 我们更新了代码,修复了template这个bug,您可以重新尝试一下我们的代码
@qyc-98 感谢回复,我目前已经更新到最新版本。但是有出现以下的情况是我所疑惑的
您好 我们更新了微调的训练代码以及模型代码,你可以重新尝试一下,我这里试了还是会有正常输出的
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
I currently use the default command to finetune model . command info is
tensorboard alse looked like correct
期望行为 | Expected Behavior
during the test phrase , I replace the model_name_path to output_minicpmv2 path
something went wrong, model stopped to give response or gave empty response
复现方法 | Steps To Reproduce
I think the main problem came from my
train.json
file , so I give some samples hereI use the text2 image dateset
pokemon-blip-captions-en-zh
and just simply restruct my json file运行环境 | Environment
备注 | Anything else?
No response