Beckschen / LLaVolta

Efficient Multi-modal Models via Stage-wise Visual Context Compression
Apache License 2.0
28 stars 2 forks source link

The loss curve of training #2

Closed lzhxmu closed 4 days ago

lzhxmu commented 3 weeks ago

Hi! Thanks for your inspiring work!

As you mentioned in the main paper, "The simple pooling operation makes training stable." Could you provide a comparison of training losses for different Visual Context Compressors (Pooling and Pruning) to support this conclusion?

This would make LLaVolta more solid.

Beckschen commented 3 weeks ago

Hi,

Thank you for your interest! We compared the training results of different visual context compressors in Table 3. Regarding the statement "The simple pooling operation makes training stable," we want to clarify that while advanced compressors (e.g., attention-based token pruning) excel in inference-only scenarios, the simple pooling method performs better during training. We hypothesize that this is because training advanced compressors, such as attention-based pruning, necessitates (1) differentiable token selection and (2) stable attention mechanisms of LM Transformers. We will have a minor update of the code and checkpoints in the near future and are happy to release the training curve as requested. 😊

Best Jieneng