This is a Python project template to easily run experiments on Compute Canada clusters.
Experiment parameters can be set with strongly-typed structured configuration using hydra and overwritten from the command-line.
Jobs on the SLURM cluster are started from Python with submitit which allows automated hyper parameter tuning with nevergrad.
The experiment metrics can be tracked on comet.
This template shows as an example how to fine-tune Mistral 7B on the OpenHermes-2.5 dataset using QLoRA. See Example Usage.
This section describes the structure of CoCPyT.
├── README.md # This file
├── config.sh # Configuration related to Compute Canada
├── create_venv.sh # Create a python virtual environment archive
├── data # Contains the training data
├── experiments # Contains the different experiments
│ ├── abrasive_reef_4045 # Each experiment has a random name
│ │ └── checkpoints # Checkpoints stored during training
│ ├── angular_bison_3657
│ ├── buoyant_pancake_3541
├── main.py # Main python file to start experiments
├── model # Model weights
├── run_on_cluster.sh # Run main.py on the computing cluster
├── setup_environment.sh # Setup the virtual environment on the compute node
├── src # Project sources
│ ├── __init__.py # Define a python module
│ ├── callbacks.py # Training callbacks
│ ├── config.py # Experiment structured configuration
│ ├── data.py # Dataloader
│ └── train.py # Trainer
└── words.csv # List of words to use for experiment names
This structure can of course be adapted to suit your needs. Here is a rundown of the important files:
main.py
contains the code to start new jobs on the SLURM cluster. It can start a single training jobs, or multiple jobs for hyper parameter searches. You can add more actions (for instance test) to this file. For instance, using ./main.py +action=train
will call the train
function of train.py
. You can add easily add more actions.train.py
You should modify this to train the model you need.config.py
This file contains the default configuration of your experiments. Every configuration defined here can be accessed through Hydra's config object cfg
in your code. They can also be easily overwritten in the command-line, and you can also define different presets. Please read the Example Usage section and Hydra's documentation for more info.config.sh
This file contains cluster related configuration, for instance where to store your project files on the node or which modules to load.This section explains how to setup your cluster for experiments.
From April 2024, Compute Canada uses Multifactor authentication. Since the run_on_cluster
script automatically opens ssh connections, you can use a persistent SSH connection to avoid being asked for the second factor code multiple times.
For instance to only be asked for your code every day, edit ~/.ssh/config
and add the following lines. Replace your_username
with your username on the cluster. This example is for beluga but you can use any other cluster.
Host beluga
Hostname beluga.alliancecan.ca
User your_username
ControlPath ~/.ssh/cm-%r@%h:%p
ControlMaster auto
ControlPersist 1d
To use comet and see your training metrics graphed in real time, you must add your comet API key to ~/.comet.config
on the cluster. To get an API key, create a comet account and go your your account settings. They offer a free academic accounts here: https://www.comet.com/signup?plan=academic
[comet]
api_key=YOUR_COMET_API_KEY
You can add this file to your home directory on your own machine as well if you want to also log your local runs on comet.
If you don't want to use comet, you can add ~comet
in your experiment parameters. If you'd like to add support for an open source alternative to comet, please make a PR!
In order to run your Python project on Compute Canada, you need to configure the right modules and identify the correct requirements. See here for more information about Python on Compute Canada.
Python versions and certain dependencies like cuda are managed by modules on Compute Canada. The modules can be configured in config.sh
. For instance, the default modules below are suitable for running a machine learning project with cuda and Python 3.11.5. The httpproxy
module is required for comet. You can find a list of available modules here.
MODULES="StdEnv/2023 python/3.11.5 arrow/15.0.1 cuda/12.2 httpproxy/1.0"
The requirements_cluster.txt
file contains the required Python dependencies for your project and must be crafted to work with the particular cluster you're using, because all of the usual Python packages might not be available there.
It can be a bit difficult to find the right versions of the Python modules that are compatible with the cluster, an those versions will most likely differ from your local Python installation because Compute Canada clusters use a custom wheelhouse. Here's a general method to build requirements_cluster.txt
file.
run_on_cluster
script:
./run_on_cluster.sh --sync-only beluga
/tmp
:
ssh beluga
cd ~/scratch/CoCPyT
source config.sh
module load $MODULES
virtualenv --no-download /tmp/$USER/venv
source /tmp/$USER/venv/bin/activate
python3 main.py "~slurm" +action=train
It will likely fail and complain about a missing wheel. For instance:
ModuleNotFoundError: No module named 'submitit'
--no-index
, or contact Compute Canada technical support and ask them to add the wheel. Note that some wheels like pyarrow
are only available through modules, so you need to make sure they don't end up in your requirements_cluster.txt
.
pip install --no-index submitit
requirements_cluster.txt
file:
pip freeze > requirements_cluster.txt
The requirements_cluster.txt
file is used to generate a Python virtual environment containing all of the required packages. This environment is gzipped so it can be quickly transferred to the compute nodes. This method ensures that the tasks spawn very quickly by avoiding to reinstalling the environment every time. Since virtual environment are not relocatable, we build and extract them in /tmp
, a path that is accessible on both the login and compute nodes.
To generate the virtual environment, run ./create_venv.sh
on the login node. The resulting venv.tar.gz
will be automatically copied to the compute node and extracted when starting a task. You need to run that script again each time you modify the requirements.
The parameters for your experiments are defined in src/config.py
and are structured and strongly-typed. Read the hydra documentation to learn how to modify this file to add your own parameters. You can also define presets to quickly change between different models for instance.
The experiment results and checkpoints are stored in the experiments
folder.
To use main.py
locally, please specify an action (i.e. train
) and preset (i.e. base
) using:
python3 main.py +action=[action] +preset=[preset]
Hydra is used to overwrite parameters directly from the commandline.
For instance, you can overwrite the batch_size and n_epochs by doing:
python3 main.py +action=train +preset=base data.batch_size=16 train.n_epochs=10
To disable submitit and/or comet you can use the ~
operator:
python3 main.py +action=train +preset=base "~comet" "~slurm"
First, set the REPO_NAME
in config.sh
to the name of your project.
Then start the training with:
./run_on_cluster.sh [host] <params>
Where <params>
are the parameters to pass to main.py
(see Training locally)
The code for the current branch will be synchronized to the REPO_PATH
directory on the cluster and the params
will be passed to main.py
on the login node of the cluster. submitit
is used in main.py
to schedule and run the jobs with SLURM and gather the results. The logs of the runs are stored in LOG_PATH
.
A snapshot of the git repository is taken when the job starts, so it is possible to start multiple concurrent jobs with different versions of the code.
The REPO_PATH
variable can be used to specify an alternate location to store and run the code.
If running locally (i.e. not on a SLURM cluster node), submitit
won't try to use SLURM, so you can use python3 main.py [param]
as usual on your computer.
If you want to run your job in an interactive salloc
session, add ~slurm
to the parameters of main.py
to avoid submitting the job.
Follow these instructions to fine-tune Mistral on beluga.
git clone https://github.com/g33kex/CoCPyT
./run_on_cluster.sh --sync-only beluga
requirements_cluster.txt
file that has been generated for this example:
./create_venv.sh
module load python
pip install --no-index huggingface_hub
huggingface-cli download mistralai/Mistral-7B-v0.1 --local-dir model
huggingface-cli download --repo-type dataset teknium/OpenHermes-2.5 --local-dir data
./run_on_cluster.sh beluga +action=train +preset=base
Copyright (C) 2024 g33kex, Kkameleon, ran-ya, yberreby
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
We are not affiliated with the Digital Research Alliance of Canada.