Yangyi-Chen / SOLO

Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"
Apache License 2.0
111 stars 3 forks source link

Error link in PRETRAIN_GUIDE.md #7

Open HYZ17 opened 3 months ago

HYZ17 commented 3 months ago

Thank you for your nice work.

I try to read the pretraining doc and find that there are some errors. In PRETRAIN_GUIDE.md, you mention that the script for data conversion is in scripts/data/megatron_conversion, but the folder does not exist. And the script for model conversion scripts/model/create_mmistral.py mentioned at here does not exist either.

HYZ17 commented 3 months ago

And I am interesting in reproducing the pretraining stage, but processing pipelines for some data sources in paper seem missing. I would like to know what I should do to get that part of data. Thank you.

Yangyi-Chen commented 3 months ago

Hi Yuzhen,

Thanks for your interest and sorry about the confusion.

  1. We have modified the structure a little bit. Could you recheck the folder for data conversion: scripts/data/megatron_conversion,
  2. the script for model conversion is
    python3 scripts/model/create_solo.py
  3. For the processing pipeline for some data sources, could you elaborate more on which part you are searching? Thanks!
HYZ17 commented 3 months ago

Thank you for your reply. I will check the data source latter.

But I found that the code for Solo at huggingface here does not support batch input, which is a very important feature for fast training and inference. May I ask why is that ? And btw, the code in this folder is outdated and will trigger error when doing inference.

HYZ17 commented 3 months ago

Moreover, there is a bug in your Hugging Face model code here. During inference, due to the absence of an effective condition check, the image gets re-embedded every time a new token is generated, which is unnecessary. Although this will not cause any error, it slow down the generation.

For example, in llava, when the input is just one token, we just calculate the embedding of the test input.