Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.68k stars 170 forks source link

Finetuning with raw text? #69

Closed wj210 closed 10 months ago

wj210 commented 11 months ago

Hi, i would like to ask if its possible to do fine-tuning with just raw text (in a pre-training style) with peft? I have a large corpus of text in a specific domain but lacks the ability to transform them into labelled format nor have resource for pretraining.

Would the code support this?

ChrisLiu6 commented 11 months ago
  1. If your requirement is to PEFT finetune an LLM on text data without any system prompt, you can make the following data config:

    META:
    -
    path: path/to/your/data.csv
    type: 'text'
    prompt_type: 'None'

    the data.csv file is expected to contain a column named instruction, whose contents are your raw text data. Relevant code logics are here: https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/c7fd8f83d3564e0982c63e8e0a1c8930b30c6cfe/accessory/data/alpaca.py#L115

  2. On the other hand, if your data is large and need the lazy-loading mechanism we support in the pre-training pipeline, you may start from our pre-training examples here, but changing the --llama-type argument to llama_peft or llama-adapter. If you want to start your training from an existing checkpoint, use the --pretrained_path argument.