Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.68k stars 170 forks source link

Parquet Files - Pretraining #75

Closed gian-g3dai closed 10 months ago

gian-g3dai commented 11 months ago

Hi there! Thank you for sharing this great repo.

I am trying to pretrain a model using the main_pretrain.py that calls falcon.py. The only problem I see there, is that in falcon.py only one column of the parquet file is being read (the one called 'content').

Is there a version of the script where images are read and encoded as well?

ChrisLiu6 commented 11 months ago

Hi, currently, the dataset in falcon.py is for text-only pre-training. If you wanna train on image-text pairs, you may follow the fine-tuning pipeline, with the following data config:

META:
  -
    path: path/to/your/data.csv
    type: 'image_text'
    preprocess: 'caption'
    prompt_type: caption'

the preprocess parameter works here:

https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/c7fd8f83d3564e0982c63e8e0a1c8930b30c6cfe/accessory/data/alpaca.py#L150

and the prompt_type parameter works here:

https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/c7fd8f83d3564e0982c63e8e0a1c8930b30c6cfe/accessory/data/alpaca.py#L115

Other configurations should be similar to this experiment. Note that you also need to re-write which model parameters are trainable for this stage: https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/c7fd8f83d3564e0982c63e8e0a1c8930b30c6cfe/accessory/model/LLM/llama.py#L332