[BUG] after funetine, model inference is None / empty

danyow-cheung commented 6 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

I currently use the default command to finetune model . command info is

(llava) hs@hs-System-Product-Name:/media/hs/DATA/ubuntu/code/MiniCPM-V/finetune$ python finetune.py  --model_name_or_path dia/hs/DATA/ubuntu/code/pokemon-blip-captions-en-zh/json_files/train.json   --eval_data_path  /media/hs/DATA/ubuntu/code/pd_columns false     --label_names "labels"     --prediction_loss_only false     --bf16 true     --bf16_full_eval true     _eval     --tune_vision false      --tune_llm false    --model_max_length 512     --max_slice_nums 1     --scale_resolutiodir output/output_minicpmv2     --logging_dir output/output_minicpmv2     --logging_strategy "steps"     --per_device_trait_accumulation_steps 1     --evaluation_strategy "steps"     --save_strategy "steps"     --save_steps 1000     --save_tota--adam_beta2 0.95     --warmup_ratio 0.01     --lr_scheduler_type "cosine"     --logging_steps 1     --gradient_checkpointtensorboard"

tensorboard alse looked like correct

期望行为 | Expected Behavior

during the test phrase , I replace the model_name_path to output_minicpmv2 path

something went wrong, model stopped to give response or gave empty response

--------
<User>: what is the image about 
<Assistant>: a before and after situation
<User>: what is the image about 
<Assistant>: pokemon
<User>: anything more ? 
<Assistant>: 
<User>: what is the color of the imaeg
<Assistant>: 
<User>: what is the image about

<Assistant>: 
<User>: what is the image about
<Assistant>: 
<User>: what is the image about
<Assistant>: a ball with an open mouth
<User>: what is the color 
<Assistant>: 
<User>: 熬 你又不行了
<Assistant>: 
<User>: 总结一下图片中的信息
<Assistant>: 

-----------
<User>: 图片中有什么信息
<Assistant>: 
<User>: what is the image about ? 
<Assistant>: 
<User>: what is the image about ? 
<Assistant>: 4
<User>: what is the info 
<Assistant>:

复现方法 | Steps To Reproduce

I think the main problem came from my train.json file , so I give some samples here

[{"id": 708, "image": "/home/hs/common/code/pokemon-blip-captions-en-zh/raw_data/images/708.png", "conversations": [{"role": "user", "content": "<image>\n What is the image about ?"}, {"role": "assistant", "content": "a red bird with black wings flying through the air"}]}, {"id": 709, "image": "/home/hs/common/code/pokemon-blip-captions-en-zh/raw_data/images/709.png", "conversations": [{"role": "user", "content": "<image>\n What is the image about ?"}, {"role": "assistant", "content": "a cartoon character flying through the air"}]}, {"id": 710, "image": "/home/hs/common/code/pokemon-blip-captions-en-zh/raw_data/images/710.png", "conversations": [{"role": "user", "content": "<image>\n What is the image about ?"}, {"role": "assistant", "content": "a drawing of a cartoon character with eyes and a nose"}]}, {"id": 711, "image": "/home/hs/common/code/pokemon-blip-captions-en-zh/raw_data/images/711.png", "conversations": [{"role": "user", "content": "<image>\n What is the image about ?"}, {"role": "assistant", "content": "a picture of a butterfly made out of paper"}]},]

I use the text2 image dateset pokemon-blip-captions-en-zh and just simply restruct my json file

运行环境 | Environment

- OS:            Ubuntu 23.10
- Python:        3.10.14
- Transformers:  4.40.0 
- PyTorch:       2.1.2
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 12.1

备注 | Anything else?

No response

JasonLeeFdu commented 6 months ago

Run into the same problem. It seems all of a suddent both my finetuned model and the official model change to morons.

Especially, this is the result by the official "web_demo.py". I can't figure out the reason.

JasonLeeFdu commented 6 months ago

drives me crazy

dhruvil237 commented 6 months ago

did anyone find the solution to this?

zhiweihu1103 commented 6 months ago

same problem, after finetuning, I also don't get any output.

creiglas-lgai commented 6 months ago

I'm encountering the same problem after LoRA fine-tuning. Are you inferencing your fine-tuned model with the infrastructure from chat.py, or are you loading the fine-tuned model manually with AutoPeftModelForCausalLM? If the latter, could you share your inference code?

zhiweihu1103 commented 6 months ago

I use the web_demo_2.5.py to inference, and directly change the model_path to the output checkpoint, don't modify anything.

JasonLeeFdu commented 6 months ago

As far as I have worked out, It seems that the hf-ds training code is problematic. I guess that is the reason why ds.sh is called "simple ft"...

I guess the problem is in trainer.py, which should be like:

    if labels is not None:
            labels = labels.to(lm_logits.device)
            shift_logits = lm_logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(
                shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)
            )

NOT


 if labels is not None:
            # Flatten the tokens
            loss_fct = nn.CrossEntropyLoss()
            logits = outputs.logits.view(-1,
                                         self.model.config.vocab_size).contiguous()
            labels = labels.view(-1).long().contiguous()
            # Enable model parallelism
            labels = labels.to(logits.device)
            loss = loss_fct(logits, labels)

Hope my judgement is correct. Official code failure is quite disappointing... Now I am trying to fix the whole xie te. wait for my success...

danyow-cheung commented 6 months ago

As far as I have worked out, It seems that the hf-ds training code is problematic. I guess that is the reason why ds.sh is called "simple ft"...

I guess the problem is in trainer.py, which should be like:
    if labels is not None:
            labels = labels.to(lm_logits.device)
            shift_logits = lm_logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(
                shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)
            )
NOT
 if labels is not None:
            # Flatten the tokens
            loss_fct = nn.CrossEntropyLoss()
            logits = outputs.logits.view(-1,
                                         self.model.config.vocab_size).contiguous()
            labels = labels.view(-1).long().contiguous()
            # Enable model parallelism
            labels = labels.to(logits.device)
            loss = loss_fct(logits, labels)
Hope my judgement is correct. Official code failure is quite disappointing... Now I am trying to fix the whole xie te. wait for my success...

good luck !

JinQiangWang2021 commented 6 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[x] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[x] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

I currently use the default command to finetune model . command info is

(llava) hs@hs-System-Product-Name:/media/hs/DATA/ubuntu/code/MiniCPM-V/finetune$ python finetune.py  --model_name_or_path dia/hs/DATA/ubuntu/code/pokemon-blip-captions-en-zh/json_files/train.json   --eval_data_path  /media/hs/DATA/ubuntu/code/pd_columns false     --label_names "labels"     --prediction_loss_only false     --bf16 true     --bf16_full_eval true     _eval     --tune_vision false      --tune_llm false    --model_max_length 512     --max_slice_nums 1     --scale_resolutiodir output/output_minicpmv2     --logging_dir output/output_minicpmv2     --logging_strategy "steps"     --per_device_trait_accumulation_steps 1     --evaluation_strategy "steps"     --save_strategy "steps"     --save_steps 1000     --save_tota--adam_beta2 0.95     --warmup_ratio 0.01     --lr_scheduler_type "cosine"     --logging_steps 1     --gradient_checkpointtensorboard"

tensorboard alse looked like correct

期望行为 | Expected Behavior

during the test phrase , I replace the model_name_path to output_minicpmv2 path

something went wrong, model stopped to give response or gave empty response

--------
<User>: what is the image about 
<Assistant>: a before and after situation
<User>: what is the image about 
<Assistant>: pokemon
<User>: anything more ? 
<Assistant>: 
<User>: what is the color of the imaeg
<Assistant>: 
<User>: what is the image about

<Assistant>: 
<User>: what is the image about
<Assistant>: 
<User>: what is the image about
<Assistant>: a ball with an open mouth
<User>: what is the color 
<Assistant>: 
<User>: 熬 你又不行了
<Assistant>: 
<User>: 总结一下图片中的信息
<Assistant>: 

-----------
<User>: 图片中有什么信息
<Assistant>: 
<User>: what is the image about ? 
<Assistant>: 
<User>: what is the image about ? 
<Assistant>: 4
<User>: what is the info 
<Assistant>:

复现方法 | Steps To Reproduce

I think the main problem came from my train.json file , so I give some samples here

[{"id": 708, "image": "/home/hs/common/code/pokemon-blip-captions-en-zh/raw_data/images/708.png", "conversations": [{"role": "user", "content": "<image>\n What is the image about ?"}, {"role": "assistant", "content": "a red bird with black wings flying through the air"}]}, {"id": 709, "image": "/home/hs/common/code/pokemon-blip-captions-en-zh/raw_data/images/709.png", "conversations": [{"role": "user", "content": "<image>\n What is the image about ?"}, {"role": "assistant", "content": "a cartoon character flying through the air"}]}, {"id": 710, "image": "/home/hs/common/code/pokemon-blip-captions-en-zh/raw_data/images/710.png", "conversations": [{"role": "user", "content": "<image>\n What is the image about ?"}, {"role": "assistant", "content": "a drawing of a cartoon character with eyes and a nose"}]}, {"id": 711, "image": "/home/hs/common/code/pokemon-blip-captions-en-zh/raw_data/images/711.png", "conversations": [{"role": "user", "content": "<image>\n What is the image about ?"}, {"role": "assistant", "content": "a picture of a butterfly made out of paper"}]},]

I use the text2 image dateset pokemon-blip-captions-en-zh and just simply restruct my json file

运行环境 | Environment

- OS:            Ubuntu 23.10
- Python:        3.10.14
- Transformers:  4.40.0 
- PyTorch:       2.1.2
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 12.1

备注 | Anything else?

No response

大佬请教一下，你微调的时候用的deepspeed那个版本，还有cuda，我使用cuda11.8,其他都是严格按照requirement.txt中install的，但是总是报错deepspeed，各个版本都试过了，还是不行，

RuntimeError: Error building extension 'fused_adam'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9851) of binary: /public/home/user/miniconda3/envs/llm/bin/python
Traceback (most recent call last):

详细的错误信息如下

Detected CUDA files, patching ldflags
Emitting ninja build file /public/home/user/.cache/torch_extensions/py310_cu118/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Traceback (most recent call last):
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
Loading extension module fused_adam...Loading extension module fused_adam...

    subprocess.run(
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 323, in <module>
Traceback (most recent call last):
  File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 323, in <module>
    train()
  File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 313, in train
    trainer.train()
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
    train()
  File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 313, in train
Traceback (most recent call last):
  File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 323, in <module>
    trainer.train()
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
    train()
  File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 313, in train
    trainer.train()
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
    return inner_training_loop(
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2015, in _inner_training_loop
    return inner_training_loop(    
return inner_training_loop(  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2015, in _inner_training_loop

  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2015, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
    result = self._prepare_deepspeed(*args)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
        result = self._prepare_deepspeed(*args)result = self._prepare_deepspeed(*args)

  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/__init__.py", line 176, in initialize
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/__init__.py", line 176, in initialize
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/__init__.py", line 176, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 307, in __init__
    engine = DeepSpeedEngine(args=args,
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 307, in __init__
    engine = DeepSpeedEngine(args=args,
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 307, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_optimizer
    self._configure_optimizer(optimizer, model_parameters)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_optimizer
    self._configure_optimizer(optimizer, model_parameters)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer
        basic_optimizer = self._configure_basic_optimizer(model_parameters)optimizer = FusedAdam(

  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
    optimizer = FusedAdam(
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
    optimizer = FusedAdam(
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
Loading extension module fused_adam...
Traceback (most recent call last):
  File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 323, in <module>
    train()
  File "/public/home/user/lzujqwang/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 313, in train
    trainer.train()
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
    return inner_training_loop(
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2015, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
    result = self._prepare_deepspeed(*args)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/__init__.py", line 176, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 307, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer
    optimizer = FusedAdam(
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
      File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load
fused_adam_cuda = FusedAdamBuilder().load()    
fused_adam_cuda = FusedAdamBuilder().load()  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load

      File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load
fused_adam_cuda = FusedAdamBuilder().load()
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load
    return self.jit_load(verbose)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load
    return self.jit_load(verbose)    
return self.jit_load(verbose)      File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load

return self.jit_load(verbose)  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load

      File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load
op_module = load(name=self.name,
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    op_module = load(name=self.name,
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    op_module = load(name=self.name,
      File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
op_module = load(name=self.name,
      File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
    return _jit_compile(
      File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
    return _jit_compile(
      File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
return _jit_compile(
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
    _write_ninja_file_and_build_library(
      File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
return _import_module_from_library(name, build_directory, is_python_module)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
  File "<frozen importlib._bootstrap_external>", line 1176, in create_module
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
    _run_ninja_build(
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
      File "<frozen importlib._bootstrap_external>", line 1176, in create_module
module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /public/home/user/.cache/torch_extensions/py310_cu118/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
  File "<frozen importlib._bootstrap_external>", line 1176, in create_module
ImportError: /public/home/user/.cache/torch_extensions/py310_cu118/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /public/home/user/.cache/torch_extensions/py310_cu118/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9851) of binary: /public/home/user/miniconda3/envs/llm/bin/python
Traceback (most recent call last):
  File "/public/home/user/miniconda3/envs/llm/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/public/home/user/miniconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-06-07_12:45:07
  host      : gpu008.cluster.cn
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 9852)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-06-07_12:45:07
  host      : gpu008.cluster.cn
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 9853)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-06-07_12:45:07
  host      : gpu008.cluster.cn
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 9854)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-07_12:45:07
  host      : gpu008.cluster.cn
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 9851)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

danyow-cheung commented 6 months ago

@JinQiangWang2021
你需要改

subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

改成['ninja','--version']

我的deepspeed版本是 0.14.2

JasonLeeFdu commented 6 months ago

As far as I have worked out, It seems that the hf-ds training code is problematic. I guess that is the reason why ds.sh is called "simple ft"... I guess the problem is in trainer.py, which should be like:
    if labels is not None:
            labels = labels.to(lm_logits.device)
            shift_logits = lm_logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(
                shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)
            )
NOT
 if labels is not None:
            # Flatten the tokens
            loss_fct = nn.CrossEntropyLoss()
            logits = outputs.logits.view(-1,
                                         self.model.config.vocab_size).contiguous()
            labels = labels.view(-1).long().contiguous()
            # Enable model parallelism
            labels = labels.to(logits.device)
            loss = loss_fct(logits, labels)
Hope my judgement is correct. Official code failure is quite disappointing... Now I am trying to fix the whole xie te. wait for my success...
good luck !

Unfortunately, I was wrong again. The data is already shifted. ....郁闷了

HuanLiuNLP commented 6 months ago

我这边也遇到了这个问题确认下训练后保存的tokenizer_config.json，"chat_template"是否和原始模型一致（末尾要拼上'<|start_header_id|>assistant<|end_header_id|>\n\n）

dhruvil237 commented 6 months ago

我这边也遇到了这个问题确认下训练后保存的tokenizer_config.json，"chat_template"是否和原始模型一致（末尾要拼上'<|start_header_id|>assistant<|end_header_id|>\n\n）

This is the actual problem i think. because the finetune config doesn't have '<|start_header_id|>assistant<|end_header_id|>\n\n）at the end. adding it manually atleast solves the empty response problem. Would be helpful if someone can explain why that part is not there by default in the tokenizer_config.json.

zhiweihu1103 commented 6 months ago

我说一下我的情况，我也是出现空响应：训练后的tokenizer_config.json里面的'chat_template'是：

"chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}"

原始模型tokenizer_config.json里面的'chat_template'是：

"chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}"

两者确实不一样，但不清楚是否是造成问题的主要原因。

zhiweihu1103 commented 6 months ago

更新一下，在将{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}"添加到生成的tokenizer_config.json里面的'chat_template'末尾后，我解决了输出为空的问题。

HuanLiuNLP commented 6 months ago

我这边也遇到了这个问题确认下训练后保存的tokenizer_config.json，"chat_template"是否和原始模型一致（末尾要拼上'<|start_header_id|>assistant<|end_header_id|>\n\n）

This is the actual problem i think. because the finetune config doesn't have '<|start_header_id|>assistant<|end_header_id|>\n\n）at the end. adding it manually atleast solves the empty response problem. Would be helpful if someone can explain why that part is not there by default in the tokenizer_config.json.

Because the chat template is hard-coded in finetune.py. the chat template in tokenizerconfig.json is overwrited.--

HuanLiuNLP commented 6 months ago

看了下代码，这块的 chat template 应该就是导致推理异常的原因。 minicpm-v 原始模型配置里的 chat template 是推理时候用的，最后拼上了assistant，没问题。但是在微调的时候，在 finetune.py 里用不带 assitant的覆盖了（训练的时候确实不应该带），所以最后保存的配置也变成了不带 assistant。这样推理的时候和训练是不一致的

zhiweihu1103 commented 6 months ago

是的，这个是问题所在。

qyc-98 commented 6 months ago

您好我们更新了代码，修复了template这个bug，您可以重新尝试一下我们的代码

danyow-cheung commented 5 months ago

@qyc-98 感谢回复，我目前已经更新到最新版本。但是有出现以下的情况是我所疑惑的

使用英文prompt微调后，中文不输出推理
除了英文的template prompt之后，其他的英文输入不回复

qyc-98 commented 4 months ago

您好我们更新了微调的训练代码以及模型代码，你可以重新尝试一下，我这里试了还是会有正常输出的

OpenBMB / MiniCPM-V