baaivision / Emu

Emu Series: Generative Multimodal Models from BAAI
https://baaivision.github.io/emu2/
Apache License 2.0
1.63k stars 85 forks source link

Question about training #35

Open HongyanZhi opened 1 year ago

HongyanZhi commented 1 year ago

Thanks for your great work first! I found that the code uses "emu_encoder.decoder.lm.generate()" to produce text response and uses "emu_encoder.decoder.lm.model()" to produce latent image embeddings. So how can I output both the text and image embedding to reproduce your training process? Or training is the first to use "emu_encoder. decoder. Lm. generate ()" to generate the text and then using "emu_encoder.decoder. Lm. model ()" to generate the image embedding? Thanks for you reply!

yqy2001 commented 1 year ago

Hello. Thanks for your interest in our work. For each training example, we generate the embeddings only once. Note that for text loss we also first generate the embeddings, then compute the classification (Cross-Entropy) loss. Image loss is computed at the same place, but using the regression instead of the classification objective.

Hoyyyaard commented 11 months ago

Thanks for your reply!
I have another 2 questions:

  1. https://github.com/baaivision/Emu/blob/9671c371105f151eee60c48ac6738407238bd20c/models/pipeline.py#L115C29-L115C29 . If I use classify free guidance , noise_pred_uncond should be forward without encoder_hidden_states ? It is not the same as current code.
  2. https://github.com/baaivision/Emu/blob/9671c371105f151eee60c48ac6738407238bd20c/models/modeling_llama.py#L234 Why is the lables has 33 tokens instead of 32 tokens as the paper say?

Many thanks for your reply!