Cannot replicate the fine-tuning results on llava_instruct_150k.

liminghao0914 commented 1 year ago

I've tried both llama-adapter and llama2-accessory. But failed to replicate the multi-modal finetuning results. In llama-adapter, I used BIAS-7B.pth for pretraining. For llama2-accessory, since there's no adapter checkpoint released, I just trained it by myself. Eventually, I got the following results.

For llama-adapter, it is like

The cat is positioned in the image.

### Instruction:
What is the cat doing?

### Response:
The cat is doing.

### Instruction:
What is the cat doing?

### Response:
The cat is doing.

### Instruction:
What is the cat doing?

### Response:
The cat is doing.

### Instruction:
What is the cat doing?

### Response:
The cat is doing.

### Instruction:
What is the cat doing?

### Response:
The cat is doing.

### Instruction:
What is the cat doing?

### Response:
The cat is doing.

### Instruction:
What is the cat doing?

### Response:
The cat is doing.

### Instruction:
What is the cat doing?

### Response:
The cat is doing.

### Instruction:
What is the cat doing?

### Response:
The cat is doing.

### Instruction:
What is the cat

For llama2-accessory, the result seems more weird.

ámámám inequalityParameterзоваdropdownзовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазовазова

I did change the code for some compatibility issues since I can't use flash-attention and apex. I am sure the EOS token is added at the end.

Did I use the wrong hyper params? Attached are the corresponding shell scripts.

For llama-adapter,

python -u -m torch.distributed.launch --master_port=1113 --nproc_per_node=4 --use_env \
 main_finetune.py --data_config "$CONFIG" --batch_size 4 \
 --epochs 4 --lr 0.00003 --weight_decay 0.02 \
 --llama_path "$LLAMA_PATH" \
 --output_dir "$OUTPUT_DIR" \
 --pretrained_path "$PRETRAINED_PATH" \

For llama-accessory,

torchrun --master_port=1112 --nproc_per_node=4 main_finetune.py \
--output_dir output/"$exp_name" --epochs 4 --warmup_epochs 0.2 \
--batch_size 4 --accum_iter 2 --num_workers 4 \
--max_words 512 --precision fp16 \
--lr 0.00003 --min_lr 0.000005 --clip_grad 2 --weight_decay 0.02 \
--data_parallel "$data_parallel" --model_parallel_size "$model_parallel" --checkpointing \
--llama_type llama_adapter --llama_config "$llama_config" --tokenizer_path "$tokenizer_path" \
--pretrained_path "$pretrained_path" --pretrained_type="$pretrained_type" \
--data_config $data_config \
2>&1 | tee -a output/"$exp_name"/output.log

Appreciate any suggestions and advice. Thanks! 🙏

ChrisLiu6 commented 1 year ago

Hi, thank you for your interest in our work. The following are my suggestions:

For LLaMA-Adapter:

Can you get the correct output using our provided checkpoint (i.e. BIAS-7B.pth) and commands? Note that the --llama_path argument should always point to LLaMA-V1 weights, with which the BIAS-7B.pth checkpoint is trained.
As we mention here, there is an extra training stage on large-scale image-text-paired datasets before fine-tuning. To replicate our result, you should use the outcome of this stage (we will upload the checkpoint soon and notify you), instead of the outcome of the final fine-tuning stage (i.e. BIAS-7B.pth) for fine-tuning. However, fine-tuning over BIAS-7B.pth should still give some reasonable results, so I guess there exist other problems.

For LLaMA2-Accessory: The script you provide is correct (though we recommend using bf16 instead of fp16). However, it is used to fine-tune a model that is already pre-trained on large-scale image-text pairs. See here and you can see that the clip_proj and visual_proj parameters are not trainable by default. The code for image-text-pair pre-training will be released soon.

However, your results show that your model just says random characters, which indicates problems more than that mentioned above. Please check these two key points that are easy to miss:

Our fine-tuning script only saves the trainable parameters. This means if you just load the saved checkpoint before inference, most parameters will still be random. So you need to merge the saved fine-tuned parameters into the original LLaMA2 parameters before inference.
Could you share the command you use for inference? I think this should work for the 7B multi-modal LLaMA2 adapter:
```
  torchrun --nproc-per-node=1  demos/single_turn_mm.py \
  --llama_config /path/to/params.json --tokenizer_path /path/to/tokenizer.model \
  --llama_type llama_adapter \
  --pretrained_path /path/to/llama_adapter_merged
```
just make sure that you did not forget to set llama_type to llama_adapter. Again, pretrained_path should point to the merged checkpoint

We will update our codes and docs to better support the PEFT+Multi-modal setting very soon, please stay tuned. If you are interested in multi-modal models, we recommend this one, which generally performs better.

liminghao0914 commented 1 year ago

Thanks for your kind help. I found that the above bug was caused by the distributed training mode, which automatically adds module. in front of each layer. Look forward to the PEFT+Multi-modal setting and your future work.

Alpha-VLLM / LLaMA2-Accessory

Cannot replicate the fine-tuning results on llava_instruct_150k. #28