Open HYZ17 opened 3 months ago
And I am interesting in reproducing the pretraining stage, but processing pipelines for some data sources in paper seem missing. I would like to know what I should do to get that part of data. Thank you.
Hi Yuzhen,
Thanks for your interest and sorry about the confusion.
python3 scripts/model/create_solo.py
Thank you for your reply. I will check the data source latter.
But I found that the code for Solo at huggingface here does not support batch input, which is a very important feature for fast training and inference. May I ask why is that ? And btw, the code in this folder is outdated and will trigger error when doing inference.
Moreover, there is a bug in your Hugging Face model code here. During inference, due to the absence of an effective condition check, the image gets re-embedded every time a new token is generated, which is unnecessary. Although this will not cause any error, it slow down the generation.
For example, in llava, when the input is just one token, we just calculate the embedding of the test input.
Thank you for your nice work.
I try to read the pretraining doc and find that there are some errors. In PRETRAIN_GUIDE.md, you mention that the script for data conversion is in scripts/data/megatron_conversion, but the folder does not exist. And the script for model conversion
scripts/model/create_mmistral.py
mentioned at here does not exist either.