Open Yonv1943 opened 1 year ago
Here is the code about: (change txt to py):
DataParallel
nn.DataParallel
gpu_ids = (4, 5, 6, 7)
gpu_id = gpu_ids[0]
device = torch.device(f"cuda:{gpu_id}" if (torch.cuda.is_available() and (gpu_id >= 0)) else "cpu")
...
model = nn.DataParallel(model, device_ids=gpu_ids)
...
data = data.to(device)
target = target.to(device)
DistributedDataParallel
nn.DistributedDataParallel
'''DistributedDataParallel'''
world_size = dist.get_world_size() # num_global_GPUs = num_machines * num_GPU_per_machine
rank_id = dist.get_rank() # the GPU_id of global_GPUs
device = torch.device(f"cuda:{gpu_id}" if (torch.cuda.is_available() and (gpu_id >= 0)) else "cpu")
...
model = DistributedDataParallel(model, device_ids=[rank_id])
...
dist_inp = data[rank_id::num_gpus].to(device)
dist_lab = target[rank_id::num_gpus].to(device)
...
if __name__ == '__main__':
dist.init_process_group(backend='nccl', init_method='env://') # backend in {'nccl', 'gloo', 'mpi'}
run()
dist.destroy_process_group()
Backends torch.distributed supports three built-in backends, each with different capabilities. The table below shows which functions are available for use with CPU / CUDA tensors. MPI supports CUDA only if the implementation used to build PyTorch supports it.
https://pytorch.org/docs/stable/distributed.html
DataParallel
can be directly run.DisttributedDataParallel
needs a special launch methods:
https://pytorch.org/docs/stable/distributed.html#launch-utility
CUDA_VISIBLE_DEVICES="4,5,6,7" OMP_NUM_THREADS=4 \ python -m torch.distributed.run --nproc_per_node 4 \ DEMO_torch_nn_DistDataParallel.py
### Why we use `torch.distributed.run` instead of `torch.distributed.launch`
FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run. <-=------------------------- look at here.
Note that --use_env is set by default in torch.distributed.run.
If your script expects --local_rank
argument to be set, please
change it to read from os.environ['LOCAL_RANK']
instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
### It is possible to run DistDataParallel using `mp.spawn` by pytorch instead of ` python =m torch.distributed.run`
Why using mp.spawn is slower than using torch.distributed.launch when using multi-GPU training
https://github.com/pytorch/pytorch/issues/47587
## Output
### The running output of DataParallel:
if_full_model if_data_parallel if_save_data_in_gpu GPU memory usage for the main thread GPU memory usage for the sub thread
Experiment: Impact of DataParallel on parallel GPU cards with or without DataParallel on 0 0 1 TimeUsed 1 GPU_max_memo 68 0 0 1 1 TimeUsed 12 GPU_max_memo 68 3
Experiment: Impact of DataParallel on parallel GPU cards with or without DataParallel on Conclusion: Not putting the data in GPU memory does reduce the memory footprint, but increases the training time Experiment: Not putting the data in GPU memory 0 1 0 ERROR Putting data on the CPU without turning on full_model will report an error 1 1 0 TimeUsed 19 GPU_max_memo 4 3 1 0 0 TimeUsed 13 GPU_max_memo 13 0
Question: full_model has to be used, so does it affect training? Conclusion: Whether to turn on full_model or not, the impact on training time and memory usage is so small that it is invisible. Experiment: The effect of turning on full_model or not on a single card 0 0 1 TimeUsed 1 GPU_max_memo 68 0 1 0 1 TimeUsed 1 GPU_max_memo 68 0 Experiment: the effect of turning on full_model or not under multiple cards 0 1 1 TimeUsed 12 GPU_max_memo 68 3 1 1 1 TimeUsed 12 GPU_max_memo 68 3
### The running output of DistributedDataParallel:
I use MNIST-fashion data in this demo.
In order to simulate the large data and larget neural network training, I use `data_repeat` to increase the MNIST data.
`data_repeat=N` means repeating the MNIST data from `28**2` to `28**2 * N`. This opearation increases the input data and the MLP networks (the input dimension will be `28**2 * N`).
data_repeat: int = 24 world_size 4 rank_id 2 world_size 4 rank_id 1 world_size 4 rank_id 0 world_size 4 rank_id 3 Train Loop: repeat_times: 3 obj: 0.035 GPU_memo 15 repeat_times: 7 obj: 0.026 GPU_memo 15 repeat_times: 11 obj: 0.022 GPU_memo 15 repeat_times: 15 obj: 0.020 GPU_memo 15 repeat_times: 19 obj: 0.018 GPU_memo 15 repeat_times: 23 obj: 0.016 GPU_memo 15 repeat_times: 27 obj: 0.014 GPU_memo 15 repeat_times: 31 obj: 0.014 GPU_memo 15 TimeUsed 1 18 TimeUsed 1 18 TimeUsed 1 18 TimeUsed 1 18 ;
data_repeat: int = 64 world_size 4 rank_id 2 world_size 4 rank_id 3 world_size 4 rank_id 1 world_size 4 rank_id 0 Train Loop: repeat_times: 3 obj: 0.049 GPUMemo 40 repeat_times: 7 obj: 0.058 GPUMemo 40 repeat_times: 11 obj: 0.050 GPUMemo 40 repeat_times: 15 obj: 0.037 GPUMemo 40 repeat_times: 19 obj: 0.033 GPUMemo 40 repeat_times: 23 obj: 0.030 GPUMemo 40 repeat_times: 27 obj: 0.029 GPUMemo 40 repeat_times: 31 obj: 0.028 GPUMemo 40 TimeUsed 2 MaxGPUMemo 47 TimeUsed 2 MaxGPUMemo 47 TimeUsed 2 MaxGPUMemo 47 TimeUsed 2 MaxGPUMemo 47 ;
Here is the code about: (change py.txt
to .py
):
python -m torch.distributed.run
. DEMO_DistDataParallel.py.txtGPU_IDs
. DEMO_DistDataParallel_mp.py.txtkernel code
def dist_init(rank_id: int, world_size: int):
os.environ['MASTER_ADDR'] = '127.0.0.1' # 'localhost'
os.environ['MASTER_PORT'] = '10086'
dist.init_process_group(backend='nccl', init_method='env://', rank=rank_id, world_size=world_size)
# backend in {'nccl', 'gloo', 'mpi'}
def dist_close():
dist.destroy_process_group()
def run__train_in_fashion_mnist(rank_id: int, world_size: int):
dist_init(rank_id=rank_id, world_size=world_size) # backend in {'nccl', 'gloo', 'mpi'}
...
if if_dist_data_parallel:
model = DistributedDataParallel(model, device_ids=[rank_id])
...
dist_close()
def start_processes(gpu_ids: Tuple[int, ...]):
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_ids)[1:-1]
world_size = len(gpu_ids)
mp.set_start_method(method='spawn' if os.name == 'nt' else 'forkserver', force=True)
processes = [mp.Process(target=run__train_in_fashion_mnist, args=(rank_id, world_size), daemon=True)
for rank_id in range(0, 0 + world_size)]
[process.start() for process in processes]
[process.join() for process in processes]
if __name__ == '__main__':
GPU_IDs = (4, 5, 6, 7)
# mp.spawn(fn=run__train_in_fashion_mnist, args=(NumGPUs,), nprocs=len(GPU_IDs), join=True)
start_processes(gpu_ids=GPU_IDs)
DataParallel: multiple thread for single machine multiple GPUs
FullModel
which writes loss function into the model to solve the memory usage imbalance problem. )DistributedDataParallel: multiple processing for single or multiple machines and multiple GPUs.
FullModel
)It is very easy to add DataParallel into the code, but DataParallel brings less speed up.
It's a little tricky to use because DistributedDataParallel needs to be started from the command line, but it gives a significant speedup with 4 GPUs in single machine in high GPU memory.