RupertLuo / Valley

The official repository of "Video assistant towards large language model makes everything easy"

199 stars 13 forks source link

请问inference的时候怎样使用finetune得到的checkpoint进行推理？ #27

Closed qqqq-0 closed 9 months ago

RupertLuo commented 9 months ago

应该直接把checkpoint 地址作为 model path 就能直接加载

qqqq-0 commented 9 months ago

但是我看finetune得到的checkpoint文件夹的大小只有1.9G，之前推理跑的7B的pretrained_config有14个G，感觉权重少了很多

RupertLuo commented 9 months ago

lora 的权重嘛

qqqq-0 commented 9 months ago

其实在finetune的时候，我在yaml里设置的lora: False，但是我使用8张A100，batch_size设置为16，发现训练时每张卡占用的显存也才只有30G。所以我怀疑yaml里的设置没起作用，感觉训练时候还是使用了lora

RupertLuo commented 9 months ago

你把 yaml 发出来我看看

RupertLuo commented 9 months ago

这个函数会打印训练的参数，如果是lora 会打印lora的参数 print_trainable_params(model)

qqqq-0 commented 9 months ago

在日志里看到了：Total: 974.36M Trainable: 540.07M，看来应该是使用了lora？我是自己新建了一个finetune的config，这个新建的config里设置为lora: False，但是原有的valley_stage2.yaml里我改成了lora: Ture，难道执行时候lora的参数还是使用了valley_stage2.yaml的么？

RupertLuo commented 9 months ago

你训练的时候有指定 --conf 为你自己的 config了嘛

qqqq-0 commented 9 months ago

有指定的，用了自己的config

qqqq-0 commented 9 months ago

我自己的config里这部分是这么设置的： project_name: valley run_name: valley_stage2

Whether to make the system prompt a mask in the label, and others do not mask

only_mask_system: False

system prompt style

conv_mode: v1

wether freeze backbone

freeze_backbone: False

wether tune multimodal projection layer

tune_mm_mlp_adapter: True

wether lora

lora: False

wether multimodal

is_multimodal: True

RupertLuo commented 9 months ago

看看你的 output_dir 保存了啥东西，看起来你的 config 没有问题

qqqq-0 commented 9 months ago

output_dir里，模型权重只有一个pytorch_model.bin，1.9G；还有几个rng_satte0-7.pth和一些.json，这些文件都是几十KB大小

RupertLuo commented 9 months ago

感觉不太对，你重新跑一下，我看看你的训练参数打印出来是什么，会打印一个表格

qqqq-0 commented 9 months ago

是指日志里打印的的表格吗，我看是这个： | Parameter Name | Max Layer Number | | model.layers..post_attention_layernorm.weight | 2 | | model.norm.weight | 1 | | model.mm_projector.bias | 1 | +------------------------------------------------+------------------+ +------------------------------------------------+------------------+ | model.embed_tokens.weight | 1 | | model.layers..self_attn.k_proj.weight | 2 | | model.layers..self_attn.v_proj.weight | 2 | | model.mm_projector.weight | 1 | +------------------------------------------------+------------------+ | model.layers..self_attn.q_proj.weight | 2 | | model.layers..self_attn.o_proj.weight | 2 | | model.layers..mlp.gate_proj.weight | 2 | | model.layers..mlp.down_proj.weight | 2 | | model.layers..mlp.up_proj.weight | 2 | | model.layers.*.input_layernorm.weight | 2 | Total: 974.36M Trainable: 540.07M

RupertLuo commented 9 months ago

看起来你只训练了两层

qqqq-0 commented 9 months ago

谢谢，找到问题了，是pretrained_config里的num_hidden_layers设置有些问题。请问看打印的日志的话，我这个训练任务是没有使用lora的吧？

RupertLuo commented 9 months ago

对的，如果是lora，可训练参数的名称里面会有 lora字眼

qqqq-0 commented 9 months ago

好的，谢谢，如果使用没有lora的fitetune得到的checkpoint的话，应该直接把checkpoint 地址作为 model path 就能inference吧？

RupertLuo commented 9 months ago

是的是的，你觉得不稳妥的话，你可以把save step 设置小一点看看那保存下来的东西对不对

qqqq-0 commented 9 months ago

好的，谢谢