Parquet Files - Pretraining

Hi, currently, the dataset in falcon.py is for text-only pre-training. If you wanna train on image-text pairs, you may follow the fine-tuning pipeline, with the following data config:

META:
  -
    path: path/to/your/data.csv
    type: 'image_text'
    preprocess: 'caption'
    prompt_type: caption'

the preprocess parameter works here:

https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/c7fd8f83d3564e0982c63e8e0a1c8930b30c6cfe/accessory/data/alpaca.py#L150

and the prompt_type parameter works here:

https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/c7fd8f83d3564e0982c63e8e0a1c8930b30c6cfe/accessory/data/alpaca.py#L115

Other configurations should be similar to this experiment. Note that you also need to re-write which model parameters are trainable for this stage: https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/c7fd8f83d3564e0982c63e8e0a1c8930b30c6cfe/accessory/model/LLM/llama.py#L332

Alpha-VLLM / LLaMA2-Accessory

Parquet Files - Pretraining #75