Paper link: arXiv
We explore several aspects of CLIP training with limited resources (e.g., up to tens of GPUs). First, we introduce FastCLIP, a general CLIP training framework built on advanced compositional optimization techniques while designed and optimized for the distributed setting. Second, to further boost training efficiency, we investigate three components of the framework from an optimization perspective: the schedule of the inner learning rate, the update rules of the temperature parameter and the model parameters, respectively. Finally, we benchmark the performance of FastCLIP and OpenCLIP on different compute scales (up to 32 GPUs on 8 nodes), and three data scales (CC3M, CC12M and LAION400M).
Table of Contents
The FastCLIP framework is an efficient distributed training framework of CLIP models, which includes a family of algorithms differing in loss computation and temperature parameter update. It is powered by advanced finite-sum coupled compositional optimization (FCCO) [4] techniques for optimizing global contrastive losses, which is suitable for training CLIP models with limited compute resources. Below we introduce three algorithms in this framework which are named FastCLIP-v1 to v3. We first present the pseudo-code of FastCLIP:
The three algorithms have different ways of computing the loss and updating the temperature $\tau$ (Line 9 and Line 11 in Algorithm 1). Specifically, FastCLIP-v1 optimizes the Global Contrastive Loss (GCL), which is first used by SogCLR [1]:
where $\varepsilon$ is a small constant. For $\tau$ update, FastCLIP-v1 sets it to a constant, as in SogCLR. FastCLIP-v1 differs from SogCLR in the schedule of the inner learning rate $\gamma$ in Eqn. (2) in the algorithm: SogCLR sets $\gammat$ to a constant, while FastCLIP-v1 updates it using a cosine decay schedule: Let $E{\mathrm{cur}}$ denote the current epoch, and $E$ denote the number of decay epochs, then $\gamma_t$ is set to:
FastCLIP-v2 optimizes the Robust Global Contrastive Loss (RGCL), which is first used by iSogCLR [2]. In RGCL, the temperature parameter becomes a variable that needs to be optimized. Also, each data point now has its individual temperature parameter, as opposed to the global temperature in GCL. Let $\tau1 =(\tau{1,1}, \ldots, \tau_{1, n})$, $\tau2 =(\tau{2,1}, \ldots, \tau_{2, n})$, then RGCL is defined as:
where $\tau_0$ is a small value, $\rho\geq 0$ is a hyperparameter. Similarly, the difference between FastCLIP-v2 and iSogCLR lies in the schedule of the inner learning rate $\gamma$, where the former leverages the cosine schedule and the latter uses the constant schedule. FastCLIP-v3 optimizes a variant of RGCL which we name RGCL with global temperature (RGCL-g):
The main difference between RGCL and RGCL-g is that RGCL-g unifies the individual temperature parameter into a single global temperature. In FastCLIP-v3, the temperature parameter is also learnable. In the following table we provide comparison between different algorithms.
Note that OpenCLIP [3] uses the Mini-Batch Contrastive Loss (MBCL), which requires a large batch size for good performance. "FCCO" means the algorithm leverages finite-sum coupled compositional optimization techniques. "Distributed" means the algorithm is designed for distributed training. In "Temperature Scheme", "G" means global temperature while "I" means individual temperature.
Next we present the results of FastCLIP vs. OpenCLIP, SogCLR and iSogCLR. For more results please refer to our paper.
FastCLIP vs. OpenCLIP: We plot the ImageNet Top1 accuracy curve of OpenCLIP and FastCLIP-v3 in the xlarge-scale setting (LAION400M, batch size 5120) in subfigure (a) and the average of ImageNet and variants performance across different number of nodes in the medium-scale (CC3M, batch size 1024) and large-scale (CC12M, batch size 2048) settings in subfigures (b) and (c), respectively.
We also plot the average of ImageNet and variants curves of OpenCLIP and FastCLIP-v3 in the medium-scale and large-scale settings (corresponding to subfigures (b) and (c) above).
FastCLIP vs. SogCLR, iSogCLR: The following table shows the results of FastCLIP-v1 (FastCLIP-v2, resp.) vs. SogCLR (iSogCLR, resp.) in the medium-scale and large-scale settings.
The following figure shows the training time in the medium-scale and large-scale settings. Subfigures (a) and (b) plot the per-iteration training time. Subfigures (c) and (d) plot the communication time per iteration.
[1] Zhuoning Yuan, Yuexin Wu, Zi-Hao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, and Tianbao Yang. Provable stochastic optimization for global contrastive learning: Small batch does not harm performance. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 25760–25782. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/yuan22b.html.
[2] Zi-Hao Qiu, Quanqi Hu, Zhuoning Yuan, Denny Zhou, Lijun Zhang, and Tianbao Yang. Not all semantics are created equal: Contrastive self-supervised learning with automatic temperature individualization. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28389–28421. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/qiu23a.html.
[3] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. https://doi.org/10.5281/zenodo.5143773, July 2021. URL https://doi.org/10.5281/zenodo.5143773.
[4] Bokun Wang and Tianbao Yang. Finite-sum coupled compositional stochastic optimization: Theory and applications. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23292–23317. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/wang22ak.html.
To set up the environment for training, please
git clone https://github.com/Optimization-AI/fast_clip.git
cd fast_clip
conda create -n fastclip python=3.11
conda activate fastclip
pip install -r requirements-training.txt
We present sample slurm scripts to run OpenCLIP and FastCLIP-v0 to v3. For non-slurm instructions, please refer to the end of this subsection. To train on your own data, you need to modify the following options
--train-data
: the path to the training data, currently only webdataset format is supported.--train-num-samples
: this many samples will be seen for one epoch, we recommend to set it to the actual size of the dataset.--data_size
: the original size of the dataset, this may take a value different from --train-num-samples
. In the case of CC3M, its metadata contains 3318333 image-URL/caption pairs, but we were only able to down 2723840 of them. So we set --data_size
to 3318333 and set --train-num-samples
to 2723848.--epochs
: for this many epochs the model will be trained.--gamma_decay_epochs
: for this many epochs $\gamma$ will decrease from 1.0 to --gamma
. We recommend to set it to half of --epochs
.Non-slurm Training: For non-slurm training, please set master_addr
manually (e.g., 127.0.0.1
), change srun python -u src/training/main.py
to cd src && torchrun --nproc_per_node=4 --rdzv_endpoint=$master_addr -m training.main
, and run the above script with /bin/bash
.
--resume
) on ImageNet-1k:Datacomp: For evaluation on the Datacomp benchmark, please refer to the "Evaluation" section in the Datacomp repository.
Non-slurm Training: For non-slurm training, please set master_addr
manually (e.g., 127.0.0.1
), change srun python -u src/training/main.py
to python src/training/main.py
, and run the above script with /bin/bash
.
If you find FastCLIP useful in your research, please consider citing the following paper:
@article{wei2024fastclip,
title={FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources},
author={Wei, Xiyuan and Ye, Fanjiang and Yonay, Ori and Chen, Xingyu and Sun, Baixi and Tao, Dingwen and Yang, Tianbao},
journal={arXiv preprint arXiv:2407.01445},
year={2024}
}