✨ DataParallel and DistributedDataParallel for speed up training.

Yonv1943 commented 1 year ago

DataParallel: multiple thread for single machine multiple GPUs

unbalance GPU memory and GPU usage. (discuss.pytorch.org: Use FullModel which writes loss function into the model to solve the memory usage imbalance problem. )
slow
Collecting gradients by a serial method

DistributedDataParallel: multiple processing for single or multiple machines and multiple GPUs.

balance GPU memory and GPU usage. (don't need to use FullModel)
faster than DataParallel
Ring-Allreduce by pytorch

It is very easy to add DataParallel into the code, but DataParallel brings less speed up.

It's a little tricky to use because DistributedDataParallel needs to be started from the command line, but it gives a significant speedup with 4 GPUs in single machine in high GPU memory.

Yonv1943 commented 1 year ago

Code

Here is the code about: (change txt to py):

The kernel code of `DataParallel`

nn.DataParallel

gpu_ids = (4, 5, 6, 7)
gpu_id = gpu_ids[0]
device = torch.device(f"cuda:{gpu_id}" if (torch.cuda.is_available() and (gpu_id >= 0)) else "cpu")
...
model = nn.DataParallel(model, device_ids=gpu_ids)
...
data = data.to(device)
target = target.to(device)

The kernel code of `DistributedDataParallel`

nn.DistributedDataParallel

'''DistributedDataParallel'''
world_size = dist.get_world_size()  # num_global_GPUs = num_machines * num_GPU_per_machine
rank_id = dist.get_rank()  # the GPU_id of global_GPUs
device = torch.device(f"cuda:{gpu_id}" if (torch.cuda.is_available() and (gpu_id >= 0)) else "cpu")
...
model = DistributedDataParallel(model, device_ids=[rank_id])
...
dist_inp = data[rank_id::num_gpus].to(device)
dist_lab = target[rank_id::num_gpus].to(device)
...
if __name__ == '__main__':
    dist.init_process_group(backend='nccl', init_method='env://')  # backend in {'nccl', 'gloo', 'mpi'}
    run()
    dist.destroy_process_group()

The communication backend to be used by the current process

Backends torch.distributed supports three built-in backends, each with different capabilities. The table below shows which functions are available for use with CPU / CUDA tensors. MPI supports CUDA only if the implementation used to build PyTorch supports it.

https://pytorch.org/docs/stable/distributed.html PyTorch_backend_for_comm_NCLL_GLOO_MPI

Run it:

DataParallel can be directly run.

DisttributedDataParallel needs a special launch methods:


https://pytorch.org/docs/stable/distributed.html#launch-utility

CUDA_VISIBLE_DEVICES="4,5,6,7" OMP_NUM_THREADS=4 \ python -m torch.distributed.run --nproc_per_node 4 \ DEMO_torch_nn_DistDataParallel.py


### Why we use `torch.distributed.run` instead of `torch.distributed.launch`

FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. <-=------------------------- look at here. Note that --use_env is set by default in torch.distributed.run. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions


### It is possible to run DistDataParallel using `mp.spawn` by pytorch instead of ` python =m torch.distributed.run`

Why using mp.spawn is slower than using torch.distributed.launch when using multi-GPU training

https://github.com/pytorch/pytorch/issues/47587

## Output

### The running output of DataParallel:

if_full_model if_data_parallel if_save_data_in_gpu GPU memory usage for the main thread GPU memory usage for the sub thread

Experiment: Impact of DataParallel on parallel GPU cards with or without DataParallel on 0 0 1 TimeUsed 1 GPU_max_memo 68 0 0 1 1 TimeUsed 12 GPU_max_memo 68 3

Experiment: Impact of DataParallel on parallel GPU cards with or without DataParallel on Conclusion: Not putting the data in GPU memory does reduce the memory footprint, but increases the training time Experiment: Not putting the data in GPU memory 0 1 0 ERROR Putting data on the CPU without turning on full_model will report an error 1 1 0 TimeUsed 19 GPU_max_memo 4 3 1 0 0 TimeUsed 13 GPU_max_memo 13 0

Question: full_model has to be used, so does it affect training? Conclusion: Whether to turn on full_model or not, the impact on training time and memory usage is so small that it is invisible. Experiment: The effect of turning on full_model or not on a single card 0 0 1 TimeUsed 1 GPU_max_memo 68 0 1 0 1 TimeUsed 1 GPU_max_memo 68 0 Experiment: the effect of turning on full_model or not under multiple cards 0 1 1 TimeUsed 12 GPU_max_memo 68 3 1 1 1 TimeUsed 12 GPU_max_memo 68 3


### The running output of DistributedDataParallel:

I use MNIST-fashion data in this demo. 
In order to simulate the large data and larget neural network training, I use `data_repeat` to increase the MNIST data.

`data_repeat=N` means repeating the MNIST data from `28**2` to `28**2 * N`. This opearation increases the input data and the MLP networks (the input dimension will be `28**2 * N`).

data_repeat: int = 24 world_size 4 rank_id 2 world_size 4 rank_id 1 world_size 4 rank_id 0 world_size 4 rank_id 3 Train Loop: repeat_times: 3 obj: 0.035 GPU_memo 15 repeat_times: 7 obj: 0.026 GPU_memo 15 repeat_times: 11 obj: 0.022 GPU_memo 15 repeat_times: 15 obj: 0.020 GPU_memo 15 repeat_times: 19 obj: 0.018 GPU_memo 15 repeat_times: 23 obj: 0.016 GPU_memo 15 repeat_times: 27 obj: 0.014 GPU_memo 15 repeat_times: 31 obj: 0.014 GPU_memo 15 TimeUsed 1 18 TimeUsed 1 18 TimeUsed 1 18 TimeUsed 1 18 ;

data_repeat: int = 64 world_size 4 rank_id 2 world_size 4 rank_id 3 world_size 4 rank_id 1 world_size 4 rank_id 0 Train Loop: repeat_times: 3 obj: 0.049 GPUMemo 40 repeat_times: 7 obj: 0.058 GPUMemo 40 repeat_times: 11 obj: 0.050 GPUMemo 40 repeat_times: 15 obj: 0.037 GPUMemo 40 repeat_times: 19 obj: 0.033 GPUMemo 40 repeat_times: 23 obj: 0.030 GPUMemo 40 repeat_times: 27 obj: 0.029 GPUMemo 40 repeat_times: 31 obj: 0.028 GPUMemo 40 TimeUsed 2 MaxGPUMemo 47 TimeUsed 2 MaxGPUMemo 47 TimeUsed 2 MaxGPUMemo 47 TimeUsed 2 MaxGPUMemo 47 ;

Yonv1943 commented 1 year ago

Code

Here is the code about: (change py.txt to .py):

multiple thread. DEMO_DataParallel.py.txt
multiple processes and launch with python -m torch.distributed.run. DEMO_DistDataParallel.py.txt
multiple processes and directly launch with given arbitrary GPU_IDs. DEMO_DistDataParallel_mp.py.txt

kernel code

def dist_init(rank_id: int, world_size: int):
    os.environ['MASTER_ADDR'] = '127.0.0.1'  # 'localhost'
    os.environ['MASTER_PORT'] = '10086'
    dist.init_process_group(backend='nccl', init_method='env://', rank=rank_id, world_size=world_size)
    # backend in {'nccl', 'gloo', 'mpi'}

def dist_close():
    dist.destroy_process_group()

def run__train_in_fashion_mnist(rank_id: int, world_size: int):
    dist_init(rank_id=rank_id, world_size=world_size)  # backend in {'nccl', 'gloo', 'mpi'}
    ...
    if if_dist_data_parallel:
        model = DistributedDataParallel(model, device_ids=[rank_id])
    ...
    dist_close()

def start_processes(gpu_ids: Tuple[int, ...]):
    os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_ids)[1:-1]
    world_size = len(gpu_ids)

    mp.set_start_method(method='spawn' if os.name == 'nt' else 'forkserver', force=True)
    processes = [mp.Process(target=run__train_in_fashion_mnist, args=(rank_id, world_size), daemon=True)
                 for rank_id in range(0, 0 + world_size)]
    [process.start() for process in processes]
    [process.join() for process in processes]

if __name__ == '__main__':
    GPU_IDs = (4, 5, 6, 7)

    # mp.spawn(fn=run__train_in_fashion_mnist, args=(NumGPUs,), nprocs=len(GPU_IDs), join=True)
    start_processes(gpu_ids=GPU_IDs)

AI4Finance-Foundation / RLSolver