IST-DASLab / torch_cgx

Pytorch distributed backend extension with compression support
GNU Affero General Public License v3.0
17 stars 0 forks source link

CGX

CGX is a pytorch extension adding a backend for pytorch distributed supporting allreduce of quantized buffers. It supports quantizations of float16, float32 to 1-8 bits.

CGX is based on MPI torch.distributed backend. The extension essentially only replaces allreduce primitive.

Quick Start

Prerequisites

CGX, as a pytorch extension, requires pytorch>=1.10.0.

For faster build we recommend to have ninja installed (pip install ninja).

The compression is only supported for GPU-based buffers so either CUDA or ROCm is required. If CUDA or ROCm are installed not in the standard paths, set [CUDA|ROCM]_HOME or [CUDA|ROCM]_PATH accordingly.

As long as it is based on MPI, it requires OpenMPI with GPU support installed (other MPI implementations were not tested). Also, the library supports NCCL based communications, so it requires NVIDIA NCCL library.

Install

export MPI_HOME=/path/to/mpi
export NCCL_HOME=/path/to/nccl
pip install pytorch-cgx

Build from source

Set MPI_HOME environment variable to mpi home. In case of AMD GPU, set CGX_CUDA to 0. Set NCCL_HOME environment variable to NCCL home, or NCCL_INCLUDE and NCCL_LIB. Set QSGD_DETERMENISTIC=0 if you want to have stochastic version QSGD.

git clone https://github.com/IST-DASLab/torch_cgx
cd torch_cgx
export MPI_HOME=/path/to/mpi
export NCCL_HOME=/path/to/nccl
python setup.py install

Usage

The only changes to the training script using pytorch distributed required are importing the built extension and specifying cgx as torch.distributed.init_process_group backend parameter.

Example:

import torch
import torch.distributed as dist
import torch_cgx

dist.init_process_group('cgx', init_method='env://', rank=args.local_rank)

Also, it order to perform layerwise compression and being able to filter small sensitive to gradient compression layers (typically these are batch norm layers and biases) the cgx needs to have information about the model. For that users need to register the communication hook. The minimal size of the layers can be controlled with layer_min_size parameter.

model = torch.nn.parallel.DistributedDataParallel(...)
from cgx_utils import cgx_hook, CGXState
state = CGXState(torch.distributed.group.WORLD, layer_min_size=1024,
                  compression_params={"bits": args.quantization_bits,
                                      "bucket_size": args.quantization_bucket_size})
model.register_comm_hook(state, cgx_hook)

As long as the extension is based on MPI backend, it requires MPI-compliant launcher (torch.distributed.launch won't work): $ mpirun -np 2 python train.py

Also, if your training script was run previously with torch.distributed.launch utility, due to MPI launcher you need to set an environment variables (see cifar_train.py in examples)

if "OMPI_COMM_WORLD_SIZE" in os.environ:
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '4040'
    os.environ["WORLD_SIZE"] = os.environ["OMPI_COMM_WORLD_SIZE"]
    os.environ["RANK"] = os.environ["OMPI_COMM_WORLD_RANK"]

Tuning

CGX can be tuned with the following environment variables:

Examples

Basic examples are provided under the example folder.

Notes