effects of pre-align stage

Hello,

It's a great work and I enjoyed reading the paper! I think it convincingly improved my understanding about VLM training for a lot :).

In the paper, you mentioned using SFT data for pre-alignment, which I assume is the same amount of data used for SFT in the full model (i.e., stage 3). If this is correct, I'm curious if you have conducted any comparisons where the pre-align stage is removed, and the pre-align stage's data is instead used as additional data in stage 2 (i.e., more data for 1 epoch) or stage 3 (i.e., 2 epochs, assuming stages 1 and 3 use the same data).

Does the pre-align stage show an advantage in these comparisons? I'm particularly interested in the scenario where the pre-align stage is discarded, and an equivalent amount of training time is added to stage 3. Another paper has shown that scaling training time (with the same amount of data) can also improve performance. I'm curious about your thoughts on this.

Thanks!

NVlabs / EAGLE

effects of pre-align stage #5