The loss curve of training

Hi,

Thank you for your interest! We compared the training results of different visual context compressors in Table 3. Regarding the statement "The simple pooling operation makes training stable," we want to clarify that while advanced compressors (e.g., attention-based token pruning) excel in inference-only scenarios, the simple pooling method performs better during training. We hypothesize that this is because training advanced compressors, such as attention-based pruning, necessitates (1) differentiable token selection and (2) stable attention mechanisms of LM Transformers. We will have a minor update of the code and checkpoints in the near future and are happy to release the training curve as requested. 😊

Best Jieneng

Beckschen / LLaVolta

The loss curve of training #2