cohere-ai / cohere-finetune

A tool that facilitates easy, efficient and high-quality fine-tuning of Cohere's models
MIT License
54 stars 5 forks source link

cohere-finetune

Cohere-finetune is a tool that facilitates easy, efficient and high-quality fine-tuning of Cohere's models on users' own data to serve their own use cases.

Currently, we support the following base models for fine-tuning:

We also support any customized base model built on one of these supported models (see Step 4 for more details).

Currently, we support the following fine-tuning strategies:

We will keep extending the base models and fine-tuning strategies we support, and keep adding more features, to help our users fine-tune Cohere's models more easily, more efficiently and with higher quality.

1. Prerequisites

To help you better decide the hardware resources you need, we list some feasible scenarios in the following table as a reference, where all the other hyperparameters that are not shown in the table are set as their default values (see here).

Hardware resources Base model Finetune strategy Batch size Max sequence length
8 * 80GB H100 GPUs Command R, Command R 08-2024, Aya Expanse 8B, Aya Expanse 32B LoRA or QLoRA 8 16384
8 * 80GB H100 GPUs Command R, Command R 08-2024, Aya Expanse 8B, Aya Expanse 32B LoRA or QLoRA 16 8192
8 * 80GB H100 GPUs Command R Plus, Command R Plus 08-2024 LoRA or QLoRA 8 8192
8 * 80GB H100 GPUs Command R Plus, Command R Plus 08-2024 LoRA or QLoRA 16 4096

2. Setup

Run the commands below on the GPU machine.

git clone git@github.com:cohere-ai/cohere-finetune.git
cd cohere-finetune

3. Fine-tuning

Throughout this section and the sections below, we use the notation <some_content_you_must_change> to denote some content that you must change according to your own use case, e.g., names, paths to files or directories, etc. Meanwhile, for any name or path that is not between the angle brackets, you must use it as it is, unless otherwise stated.

You can fine-tune a base model on your own data by following the steps below on the GPU machine (the host).

Step 1. Build the Docker image

Run the command below to build the Docker image, which may take about 18min to finish if it is the first time you build it on the host.

DOCKER_BUILDKIT=1 docker build --rm \
    --ssh default \
    --target peft-prod \
    -t <peft_prod_docker_image_name> \
    -f docker/Dockerfile \
    .

Alternatively, you may directly use the image we built for you: skip this step and use our image name ghcr.io/cohere-ai/cohere-finetune:latest as <peft_prod_docker_image_name> in the next step.

Step 2. Run the Docker container to start the fine-tuning service

Run the command below to start the fine-tuning service.

docker run -it --rm \
    --name <peft_prod_finetune_service_docker_container_name> \
    --gpus <gpus_accessible_by_the_container> \
    --ipc=host \
    --net=host \
    -v ~/.cache:/root/.cache \
    -v <finetune_root_dir>:/opt/finetuning \
    -e PATH_PREFIX=/opt/finetuning/<finetune_sub_dir> \
    -e ENVIRONMENT=DEV \
    -e TASK=FINETUNE \
    -e HF_TOKEN=<hf_token> \
    -e WANDB_API_KEY=<wandb_api_key> \
    <peft_prod_docker_image_name>

Some parameters are explained below:

If you want the service to run in the background, now you can detach from the Docker container.

Step 3. Prepare the training and evaluation data

Put one and only one file as the training data in the directory <finetune_root_dir>/<finetune_sub_dir>/<finetune_name>/input/data/training, where this file must be one of the followings:

Optionally, you can also put one and only one file as the evaluation data in the directory <finetune_root_dir>/<finetune_sub_dir>/<finetune_name>/input/data/evaluation, where this file must be in one of the three formats above. If you do not provide any evaluation data, we will split the provided training data into training and evaluation sets according to the hyperparameter eval_percentage (see Step 4 below).

Step 4. Submit the request to start the fine-tuning

Throughout this section and the sections below, we use cURL to send the requests, but you can also send the requests by Python's requests or in any other way you want. Also, you can send the requests from the host where the service is running, or from any other machine, e.g., your laptop (as long as you run, e.g., ssh -L 5000:localhost:5000 -Nf <username>@<host_address> for local port forwarding on that machine).

Run the following command to submit a request to start the fine-tuning.

curl --request POST http://localhost:5000/finetune \
    --header "Content-Type: application/json" \
    --data '{
        "finetune_name": "<finetune_name>",
        "base_model_name_or_path": "command-r-08-2024",
        "parallel_strategy": "fsdp",
        "finetune_strategy": "lora",
        "use_4bit_quantization": "false",
        "gradient_checkpointing": "true",
        "gradient_accumulation_steps": 1,
        "train_epochs": 1,
        "train_batch_size": 16,
        "validation_batch_size": 16,
        "learning_rate": 1e-4,
        "eval_percentage": 0.2,
        "lora_config": {"rank": 8, "alpha": 16, "target_modules": ["q", "k", "v", "o"], "rslora": "true"},
        "wandb_config": {"project": "<wandb_project_name>", "run_id": "<wandb_run_name>"}
    }'

The <finetune_name> must be exactly the same as that used in Step 3. If you are not going to use Weights & Biases for logging during the fine-tuning, the hyperparameter "wandb_config" can be removed. See table below for details about all the other hyperparameters we support, where some valid values or ranges below are based on best practices (you do not have to strictly follow them, but if you do not follow them, some validation codes also need to be changed or removed).

Hyperparameter Definition Default value Valid values or range
base_model_name_or_path The name of the base model or the path to the checkpoint of a customized base model "command-r-08-2024" "command-r", "command-r-08-2024", "command-r-plus", "command-r-plus-08-2024", "aya-expanse-8b", "aya-expanse-32b", "/opt/finetuning/"
parallel_strategy The strategy to use multiple GPUs for training "fsdp" "vanilla", "fsdp", "deepspeed"
finetune_strategy The strategy to train the model "lora" "lora"
use_4bit_quantization Whether to apply 4-bit quantization to the model "false" "false", "true"
gradient_checkpointing Whether to use gradient (activation) checkpointing "true" "false", "true"
gradient_accumulation_steps The gradient accumulation steps 1 integers, min: 1
train_epochs The number of epochs to train 1 integers, min: 1, max: 10
train_batch_size The batch size during training 16 integers, min: 8, max: 32
validation_batch_size The batch size during validation (evaluation) 16 integers, min: 8, max: 32
learning_rate The learning rate 1e-4 real numbers, min: 5e-5, max: 0.1
eval_percentage The percentage of data split from training data for evaluation (ignored if evaluation data are provided) 0.2 real numbers, min: 0.05, max: 0.5
lora_config.rank The rank parameter in LoRA 8 integers, min: 8, max: 16
lora_config.alpha The alpha parameter in LoRA 2 * rank integers, min: 16, max: 32
lora_config.target_modules The modules to apply LoRA ["q", "k", "v", "o"] Any non-empty subset of ["q", "k", "v", "o", "ffn_expansion"]
lora_config.rslora Whether to use rank-stabilized LoRA (rsLoRA) "true" "false", "true"

Note that you can set base_model_name_or_path as either the name of a supported model or the path to the checkpoint of a customized base model. However, if it is a path, the following requirements must be satisfied:

Also note that finetune_strategy = "lora", use_4bit_quantization = "false" corresponds to the fine-tuning strategy of LoRA, while finetune_strategy = "lora", use_4bit_quantization = "true" corresponds to the fine-tuning strategy of QLoRA.

After the fine-tuning is finished, you can find all the files about this fine-tuning in <finetune_root_dir>/<finetune_sub_dir>/<finetune_name>. More specifically, our fine-tuning service will automatically create the following folders for you:

At any time (before, during or after the fine-tuning), you can run the following command to check the status of the fine-tuning service.

curl --request GET http://localhost:5000/status

After you finish the current fine-tuning, you can do another fine-tuning (probably with different data and/or hyperparameters) by doing Step 3 and Step 4 again, but you must use a different <finetune_name>.

Step 5. Terminate the fine-tuning service

When you do not want to do any more fine-tunings, you can run the following command to termintate the fine-tuning service.

curl --request GET http://localhost:5000/terminate

Next steps

Now you have one or more fine-tuned models. If you want to deploy them in production and efficiently serve a large number of requests, here are your options:

4. Inference

We also provide a simple inference service to facilitate quick experiments or small-scale evaluations with the fine-tuned models, but this service should not be used for large-scale inference in production.

Step 1. Build the Docker image

Do Step 1 if you have not done so. Otherwise, skip this step.

Step 2. Run the Docker container to start the inference service

Run the command below to start the inference service. This command is similar to the one in Step 2. The main difference is that you need to set the environment variable TASK=INFERENCE to indicate now you want to do inference, not fine-tuning.

docker run -it --rm \
    --name <peft_prod_inference_service_docker_container_name> \
    --gpus <gpus_accessible_by_the_container> \
    --ipc=host \
    --net=host \
    -v ~/.cache:/root/.cache \
    -v <finetune_root_dir>:/opt/finetuning \
    -e ENVIRONMENT=DEV \
    -e TASK=INFERENCE \
    -e HF_TOKEN=<hf_token> \
    <peft_prod_docker_image_name>

Step 3. Submit the request to get model response

Run the following command to submit a request to the fine-tuned model and get model response. Note that this inference service is designed to be similar to Cohere's Chat API, and the port for this inference service is 5001, not 5000.

curl --request POST http://localhost:5001/inference \
    --header "Content-Type: application/json" \
    --data '{
        "model_name_or_path": "<model_name_or_path>",
        "message": "<message>",
        "chat_history": <chat_history>,
        "preamble": "<preamble>",
        "max_new_tokens": 1024,
        "do_sample": "false"
    }'

The parameters are explained below.

A caveat is that if the inference service finds the model you want to use is different from the current model it is holding, it will spend some time loading the model you want. Therefore, please do not frequently switch models during inference; you want to finish all the inferences with one model before switching to another model.

You can also run the following command to get some information of the inference service, e.g., the current model it is holding, etc.

curl --request GET http://localhost:5001/info

Step 4. Terminate the inference service

When you do not want to do any more inferences, you can run the following command to termintate the inference service.

curl --request GET http://localhost:5001/terminate

5. Development

If you want to write and run your own codes for fine-tuning by, e.g., Jupyter Notebook, you can use our tool in a development mode that provides you with more flexibility and more control on fine-tuning.

Step 1. Build the Docker image for development

Run the command below to build the Docker image, which may take about 18min to finish if it is the first time you build it on the host.

DOCKER_BUILDKIT=1 docker build --rm \
    --ssh default \
    --target peft-dev \
    -t <peft_dev_docker_image_name> \
    -f docker/Dockerfile \
    .

You can also edit the peft-dev stage in docker/Dockerfile to install any apps you need for development, and build your own Docker image for development.

Step 2. Run the Docker container for development

Run the command below to enter the container for development, where you can mount any directory on the host <dir_on_host> to the directory in the container <dir_in_container> and do as many mounts as you want. For example, it could be helpful to do -v ~/.ssh:/root/.ssh if you want to use your host's ~/.ssh in the container.

docker run -it --rm \
    --name <peft_dev_docker_container_name> \
    --gpus <gpus_accessible_by_the_container> \
    --entrypoint=bash \
    --ipc=host \
    --net=host \
    -v ~/.cache:/root/.cache \
    -v <dir_on_host>:<dir_in_container> \
    -e HF_TOKEN=<hf_token> \
    -e WANDB_API_KEY=<wandb_api_key> \
    <peft_dev_docker_image_name>

Now you can start your development work and do anything you want there.