dhg-wei / MCL

(ICML 2024) Improve Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning
22 stars 0 forks source link

When will you share the code and MMC dataset? It‘s August now~~ #3

Closed SVT-Yang closed 6 days ago

dhg-wei commented 4 months ago

Apologies for the delay. I'm currently busy with another project. I'll organize and release the code later. I expect to have everything ready in about two weeks.

Thank you for your understanding!

SVT-Yang commented 4 months ago

Thanks for your response. I have other 2 questions regarding the paper details.

  1. L_cap and L_ret Objectives: Could you please elaborate on how the L_cap and L_ret objectives are trained? the inputs for L_cap and L_ret objectives seem to differ from the inputs shown in the figure 2.
  2. Stacking Retrieval Mechanism: you mention stacking 5 [RET] tokens. Are the dimensions of stacked tokens consistent with those of V and T? I am concerned there might be a dimension mismatch. Could you provide more details on how this is addressed?

Thank you very much for your time and assistance.

dhg-wei commented 3 months ago
  1. L_cap and L_ret are trained using different forwards.

  2. The stacking mechanism is implemented by modifying attention masks and the position indices of [RET] tokens, with no other changes.

SVT-Yang commented 3 months ago

Got it. Thanks~

SVT-Yang commented 3 months ago

Come on, it's almost September!!