[BUG] 模型进行全量微调后，loss正常，但推理时乱码

Z-MU-Z commented 1 month ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

self.model = AutoModelForCausalLM.from_pretrained( model_path, device_map='cuda', trust_remote_code=True).eval()

    self.tokenizer = AutoTokenizer.from_pretrained(model_path,
                                            trust_remote_code=True)
    self.tokenizer.padding_side = 'left'
    self.tokenizer.pad_token_id = self.tokenizer.eod_id

    self.prompt = '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nPicture 1: <img>{}</img>\nPlease find the object. The object description is as follows:<ref>{}</ref><|im_end|>\n<|im_start|>assistant\n'

token_result = self.tokenizer([prompt], return_tensors='pt', padding='longest') input_ids = token_result.input_ids # print(self.tokenizer.decode(input_ids[0]))

    attention_mask = token_result.attention_mask
    pred = self.model.generate(
        input_ids=input_ids.cuda(),
        attention_mask=attention_mask.cuda(),
        do_sample=False,
        num_beams=1,
        max_new_tokens=28,
        min_new_tokens=10,
        length_penalty=1,
        num_return_sequences=1,
        use_cache=True,
        pad_token_id=self.tokenizer.eod_id,
        eos_token_id=self.tokenizer.eod_id,
        #masks_ids = mask_token
    )
    answers = [
        self.tokenizer.decode(_[input_ids.size(1):].cpu(),
                              skip_special_tokens=True) for _ in pred
    ]

模型预测全部乱码

期望行为 | Expected Behavior

模型应该正常预测输出。

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

Z-MU-Z commented 1 month ago

使用的是 finetune_ds.sh 进行训练，没有改过其他参数

Z-MU-Z commented 1 month ago

训练后ckpt的目录如下config.json configuration_qwen.py generation_config.json model.safetensors modeling_qwen.py qwen_generation_utils.py qwen.tiktoken special_tokens_map.json tokenization_qwen.py tokenizer_config.json trainer_state.json training_args.bin visual.py

whycantfindaname commented 1 month ago

想问一下全量微调需要的显存多大呢，我们在h20上batchsize只能开2

Z-MU-Z commented 1 month ago

@whycantfindaname 我开的per_divice_train_batch_size是1

whycantfindaname commented 1 month ago

我们用的是zero stage3，应该是最省显存的了。那看起来全量微调需要的资源还是挺贵的。

Z-MU-Z commented 1 month ago

看起来好像是模型保存的问题，理论上应该得保存四个safetensor，但实际上最后只保存了一个，但是看中间的checkpoint保存是正常的，目前我没有找到原因。有人知道怎么解决吗？

Z-MU-Z commented 1 month ago

我发现这个问题只在使用zero2的时候出现，zero3时正常，具体表现为最终训练完保存的时候输出了Removed shared tensor {'transformer.h.27.mlp.w2.weight', 'transformer.h.3.mlp.w1.weight', 'transformer.h.13.mlp.w1.weight', 'transformer.h.18.attn.c_attn.bias', 'transformer.visual.attn_pool.pos_embed', 'transformer.h.5.ln_1.weight', ...

看起来和transformer版本有关，我将transformers==4.37.2改为transformers==4.32.0后就正常了

QwenLM / Qwen-VL