AILab-CVC / SEED-X

Multimodal Models in Real World
Other
370 stars 17 forks source link

Wired data operations. #15

Closed idealwei closed 4 weeks ago

idealwei commented 2 months ago

After reading your code, i see some wired data operations, which are very confusing.

1-When using LAION、CapsFusion for image-caption aligning, i didn't see any prompt like "Describe this image" in "image_text_paris_clm.py"

2-When using LAION、CapsFusion for image generation, "gen_prompt"、"gen_response"、"image_caption" are concated to input to llm.However, you don't mask "gen_prompt" part when you calculate "lm_loss"

Please clarify these operations for me. Thank you.

geyuying commented 1 month ago
  1. We do not use prompts like "Describe this image" for image-caption aligning during pre-training.

  2. There was an issue with the previous version of the code, and we've already fixed this bug in the updated code (https://github.com/AILab-CVC/SEED-X/commit/cce5d32d0ae1a6a958c047a56fd7003ae6ed474f).