Distributed Training in Docker

3bsamad commented 1 year ago

First of all, thanks for the great work!

I am using the provided docker image, and currently I am trying to run distributed training, since training on only one GPU is slow. I have 3 GTX 1080s (IDs 0,1,2). I added the following args to get_args_parser() in main.py since I couldn't find args.distributed:

# * Distributed training parameters
parser.add_argument('--distributed', action='store_true', default=False, help='Use multi-processing distributed training to launch ')
parser.add_argument('--world_size', default=3, type=int, help='number of distributed processes/ GPUs to use')
parser.add_argument('--dist_url', default='env://', help='url used to set up distributed training')
parser.add_argument('--dist_backend', default='nccl', type=str, help='distributed backend') 
parser.add_argument('--local_rank', default=0, type=int, help='rank of the process')     
parser.add_argument('--gpu', default=0, type=int, help='rank of the process')

Then, in util/misc.py, in init_distributed_mode(args) I added the following:

if 'LOCAL_RANK' not in os.environ:
        os.environ['LOCAL_RANK'] = str(args.local_rank)
    if 'RANK' not in os.environ:
        os.environ['RANK'] = str(args.gpu)
    if 'WORLD_SIZE' not in os.environ:
        os.environ['WORLD_SIZE'] = str(args.world_size)
    if 'MASTER_ADDR' not in os.environ:
        os.environ['MASTER_ADDR'] = '192.168.179.13'
    if 'MASTER_PORT' not in os.environ:
        os.environ['MASTER_PORT'] = '8888'

Everything works fine when I start training up until this point:

torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                                         world_size=args.world_size, rank=args.gpu)

Where it gets stuck, showing this in the terminal | distributed init (rank 0): env:// Until I kill the process.

I have tried distributed training in docker before, using this simple example script:

import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torch.nn.parallel import DistributedDataParallel
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler

class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc = nn.Linear(784, 5000)
        self.fc2 = nn.Linear(5000, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        x = self.fc2(x)
        return x

def train(rank, num_gpus):
 # "nccl", "gloo"
    dist.init_process_group(
    backend="nccl", init_method="env://", world_size=num_gpus, rank=rank
    )
    torch.cuda.set_device(rank)

    model = SimpleNet().to(rank)
    ddp_model = DistributedDataParallel(model, device_ids=[rank])
    print("Rank ", rank, ", Model Created")
    transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]
    )
    train_set = datasets.MNIST("./data", download=True, train=True, transform=transform)
    train_sampler = DistributedSampler(
    dataset=train_set, num_replicas=num_gpus, rank=rank
    )
    train_loader = DataLoader(
    dataset=train_set,
    batch_size=4,
    shuffle=False,
    num_workers=0,
    pin_memory=False,
    sampler=train_sampler,
    )

    criterion = nn.CrossEntropyLoss().to(rank)
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.01)

    for epoch in range(1000):
        running_loss = 0.0
        for inputs, labels in train_loader:
            inputs = inputs.to(rank)
            labels = labels.to(rank)
            optimizer.zero_grad()
            outputs = ddp_model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        print("Rank ", rank, ", Epoch ", epoch, ", Loss: ", running_loss)

def main():
 num_gpus = 3
 os.environ["MASTER_ADDR"] = "localhost"
 os.environ["MASTER_PORT"] = "16855"
 mp.spawn(train, args=(num_gpus,), nprocs=num_gpus, join=True)

if __name__ == "__main__":
 main()

I am a bit new to implementing distributed training, and was wondering what might be wrong/missing here. Any help would be appreciated!

tgjantos commented 1 year ago

Dear @3bsamad,

thank you for trying out PoET! Can I ask you how big your images are and how many images are in your dataset? I am just curious as to why the training takes so long on one GPU. Unfortunately, I am not an expert on distributed training myself and I can not help you from the top of my head. However, I will look into this issue and try to educate myself more with respect to this topic.

In the mean time, I kindly refer you to the repositories of the Deformable-DETR and DETR. PoET builds essentially on top of these two repositories and they provide some more information and scripts for distributed training. Maybe, there are some issues similar to yours in the two repositories. I am pretty sure if you can get any of these two to run in distributed mode, it should work for PoET as well.

Please keep me updated if you make any progress and do not hesitate to issue a pull request, once you have a solution that provides a fix. I will come back to you once I have time to look into this topic in more detail.

Best, Thomas

3bsamad commented 1 year ago

Dear @3bsamad,

thank you for trying out PoET! Can I ask you how big your images are and how many images are in your dataset? I am just curious as to why the training takes so long on one GPU. Unfortunately, I am not an expert on distributed training myself and I can not help you from the top of my head. However, I will look into this issue and try to educate myself more with respect to this topic.

In the mean time, I kindly refer you to the repositories of the Deformable-DETR and DETR. PoET builds essentially on top of these two repositories and they provide some more information and scripts for distributed training. Maybe, there are some issues similar to yours in the two repositories. I am pretty sure if you can get any of these two to run in distributed mode, it should work for PoET as well.

Please keep me updated if you make any progress and do not hesitate to issue a pull request, once you have a solution that provides a fix. I will come back to you once I have time to look into this topic in more detail.

Best, Thomas

I am trying on a small subset of my dataset, which is only about 3900 train and 1300 test images. The images are of dimensions 800x400. It doesn't take "so long", one epoch on these is about 6-9 minutes, but I wanted to make use of my other two GPUs since training on my full dataset would be very slow. Thanks for you help, I will look more into this and try to find a solution.

3bsamad commented 1 year ago

@tgjantos I have a preliminary working solution from the DETR repo, I integrated it into this repo and I can now train on my 3 GPUs. Only problem is CUDA out of memory, but this probably has to do with the model itself/ data size. I'm looking into in right now, I can make a pull request if you want :)

Update: everything works smoothly now, was only a small error

tgjantos commented 1 year ago

@3bsamad sounds awesome! Definitely make a pull request, would be happy to integrate it into the repo!

Best, Thomas

tgjantos commented 1 year ago

Closed with #11

aau-cns / poet

Distributed Training in Docker #9