facebookresearch / metaseq

Repo for external large-scale work
MIT License
6.51k stars 727 forks source link

Fine-tune OPT with my own dataset #105

Open xiaomaiaa opened 2 years ago

xiaomaiaa commented 2 years ago

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

I wonder how to fit my own data to OPT training. Thanks a lot

Skyy93 commented 2 years ago

Because of the hugginface integration here: https://huggingface.co/docs/transformers/model_doc/opt

you should be able to train most (not the big ones) of the OPT models like you would train any other huggingface model

lorr1 commented 2 years ago

What if I wanted to finetune the 6B or 13B models? Huggingface is not optimized and without model parallelism, I'm not sure it would fit on a single GPU (even a 40GB one) (I've had to use model parallelism for GPT 6B neo models). Do you guys have a starting point to for finetuning in your, more optimized, code base? I see the train endpoint which looks like it hooks into Megatron?

Dod-o commented 2 years ago

same question

DeepTitan commented 1 year ago

Can anyone link me to a google colab or webpage showing how to do this? I am trying to use the Trainer to train opt-350m but am not having any luck doing so.

bokovhu commented 1 year ago

If anyone else is wondering how to solve the fine-tuning, I have put together some code, based on various internet sources I came across. Unfortunately, I cannot link to references, as I had to hack this together from multiple things, and I only have these scripts now ...

I have had success with both the 125m and the 350m models too, with the same settings. Adjust dataset, output directory, train steps, etc. as needed.

I suggest installing the following dependencies before going further:

pip install -q datasets accelerate loralib
pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

For fine-tuning, I used the following code:

import os
# Hack for my machine, and Windows
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

dialogs_dataset = load_dataset("text", data_files={"train": "train.txt", "test": "test.txt"})

model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
for param in model.parameters():
    param.requires_grad = False
    if param.ndim == 1:
        param.data = param.data.to(torch.float32)
model.gradient_checkpointing_enable()
model.enable_input_require_grads()
class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

train_dataset = dialogs_dataset["train"].map(lambda x: tokenizer(x["text"]), batched=True)

trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    args=TrainingArguments(
        output_dir="opt-125m-fine-tuned",
        per_device_train_batch_size=4,
        warmup_steps=100,
        max_steps=20000,
        save_steps=400,
        learning_rate=1e-4,
        logging_steps=100,
    ),
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()

model.save_pretrained("opt-125m-fine-tuned/peft-model")

NOTE, that this is LoRA, and not "proper fine-tuning". When I tried to train the model further without LoRA, my loss just would not converge after several thousand steps, so I gave up on that. I have no idea, how the performance of the model differs (or if it differs), compared to "proper fine-tuning".

For inference, I put together this code:

import torch
print(torch.cuda.is_available())
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig

def load_fine_tuned_model():
    model_id = "./opt-125m-fine-tuned/peft-model"
    config = PeftConfig.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        config.base_model_name_or_path, return_dict=True, device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
    peft_model = PeftModel.from_pretrained(model, model_id)

    return peft_model, tokenizer

model = load_fine_tuned_model()

# Use your model as you would normally do ...

If anyone's wondering, the 20k steps took ~1 hour on an RTX 3090, and my dataset for this specific experiment consisted of ~1500 items. Still, the results were promising.

Tchagoue commented 6 months ago

Could you please tell me what your data looks like?