Yxxxb / VoCo-LLaMA

VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
https://yxxxb.github.io/VoCo-LLaMA-page/
Apache License 2.0
76 stars 4 forks source link

Training Objective #4

Closed gordonhu608 closed 3 months ago

gordonhu608 commented 3 months ago

Congratulations on a Great work ! I did not go over every line of code but could you please help to point out where did you implement the KL divergence training objective for your model in your code. Thank you!!!

Yxxxb commented 3 months ago

Hi,

Thank you for your interest in our work! The training objective of KL dispersion and compressed model distillation are the ideas behind our construction of VoCo-LLaMA. In Equation 4 in the paper, our goal is to make the distribution of the output of VoCo-LLaMA approximate the distribution of the output of the original model $VLM_o$ (in this paper, we use LLaVA as an example). In terms of concrete implementation, we achieve this training paradigm by inserting VoCo token and modifying the attention mask. Thus our model only needs to be trained under the standard visual instruction tuning stage. The final loss and training objective are identical to those of LLaVA (visual instruction tuning).

By the way, the Matryoshka Query Transformer that you have proposed has interesting ideas. Best Regards,

Xubing