Convergence problem on coco with less gpus.

JarintotionDin commented 1 year ago

I'm trying train dino model on 2 GPUS with batch_size 4 using Imagenet-pretraied resnet50/resnet101, but i can't see a convergence after many iterations (about 300000), do you have appropriate suggestions for training dino with less gpus.

rentainhe commented 1 year ago

I'm trying train dino model on 2 GPUS with batch_size 4 using Imagenet-pretraied resnet50/resnet101, but i can't see a convergence after many iterations (about 300000), do you have appropriate suggestions for training dino with less gpus.

Would you like to share your training logs and config with us?

ustcwhy commented 1 year ago

I'm trying train dino model on 2 GPUS with batch_size 4 using Imagenet-pretraied resnet50/resnet101, but i can't see a convergence after many iterations (about 300000), do you have appropriate suggestions for training dino with less gpus.

Do you resolve this issue ? I met the same problem when reproducing DAB-DETR. The model diverges after 50000-60000 iterations.

JarintotionDin commented 1 year ago

I solved this problem by increasing the batch size. It seems that the destr-based methods need a large enough batch size to ensure convergence. Here is my solution that change the batch size with less gpus:

def run_step(self): """ Implement the standard training logic described above. """ assert self.model.training, "[Trainer] model was changed to eval mode!" assert torch.cuda.is_available(), "[Trainer] CUDA is required for AMP training!" from torch.cuda.amp import autocast

    start = time.perf_counter()
    """
    If you want to do something with the data, you can wrap the dataloader.
    """
    data = next(self._data_loader_iter)
    data_time = time.perf_counter() - start

    """
    If you want to do something with the losses, you can wrap the model.
    """
    with autocast(enabled=self.amp):
        loss_dict = self.model(data)
        if isinstance(loss_dict, torch.Tensor):
            losses = loss_dict
            loss_dict = {"total_loss": loss_dict}
        else:
            losses = sum(loss_dict.values())

    """
    If you need to accumulate gradients or do something similar, you can
    wrap the optimizer with your custom `zero_grad()` method.
    """

    if self.amp:
        self.grad_scaler.scale(losses).backward()
        if self.clip_grad_params is not None:
            self.grad_scaler.unscale_(self.optimizer)
            self.clip_grads(self.model.parameters())           
        if self.iter % self.batch_size_scale == 0:
            # print(self.iter)
            self.grad_scaler.step(self.optimizer)
            self.grad_scaler.update()
            self.optimizer.zero_grad()
    else:
        losses.backward()
        if self.clip_grad_params is not None:
            self.clip_grads(self.model.parameters())
        if self.iter % self.batch_size_scale == 0:
            # print(self.iter)
            self.optimizer.step()
            self.optimizer.zero_grad()

    self._write_metrics(loss_dict, data_time)

ustcwhy commented 1 year ago

I solved this problem by increasing the batch size. It seems that the destr-based methods need a large enough batch size to ensure convergence. Here is my solution that change the batch size with less gpus:

def run_step(self): """ Implement the standard training logic described above. """ assert self.model.training, "[Trainer] model was changed to eval mode!" assert torch.cuda.is_available(), "[Trainer] CUDA is required for AMP training!" from torch.cuda.amp import autocast

    start = time.perf_counter()
    """
    If you want to do something with the data, you can wrap the dataloader.
    """
    data = next(self._data_loader_iter)
    data_time = time.perf_counter() - start

    """
    If you want to do something with the losses, you can wrap the model.
    """
    with autocast(enabled=self.amp):
        loss_dict = self.model(data)
        if isinstance(loss_dict, torch.Tensor):
            losses = loss_dict
            loss_dict = {"total_loss": loss_dict}
        else:
            losses = sum(loss_dict.values())

    """
    If you need to accumulate gradients or do something similar, you can
    wrap the optimizer with your custom `zero_grad()` method.
    """

    if self.amp:
        self.grad_scaler.scale(losses).backward()
        if self.clip_grad_params is not None:
            self.grad_scaler.unscale_(self.optimizer)
            self.clip_grads(self.model.parameters())           
        if self.iter % self.batch_size_scale == 0:
            # print(self.iter)
            self.grad_scaler.step(self.optimizer)
            self.grad_scaler.update()
            self.optimizer.zero_grad()
    else:
        losses.backward()
        if self.clip_grad_params is not None:
            self.clip_grads(self.model.parameters())
        if self.iter % self.batch_size_scale == 0:
            # print(self.iter)
            self.optimizer.step()
            self.optimizer.zero_grad()

    self._write_metrics(loss_dict, data_time)

Thanks ~ May I ask the total batch size and learning rate in your experiments ?

rentainhe commented 1 year ago

Sry for missing this issue before. I was wondering if the core issue for the convergence problem is the batch-size on each gpu~ In detrex, if you directly decrease the --num-gpus and not change the dataloader.train.total_batch_size, it will allocate more data on each gpu.

rentainhe commented 1 year ago

I solved this problem by increasing the batch size. It seems that the destr-based methods need a large enough batch size to ensure convergence. Here is my solution that change the batch size with less gpus:

def run_step(self): """ Implement the standard training logic described above. """ assert self.model.training, "[Trainer] model was changed to eval mode!" assert torch.cuda.is_available(), "[Trainer] CUDA is required for AMP training!" from torch.cuda.amp import autocast

    start = time.perf_counter()
    """
    If you want to do something with the data, you can wrap the dataloader.
    """
    data = next(self._data_loader_iter)
    data_time = time.perf_counter() - start

    """
    If you want to do something with the losses, you can wrap the model.
    """
    with autocast(enabled=self.amp):
        loss_dict = self.model(data)
        if isinstance(loss_dict, torch.Tensor):
            losses = loss_dict
            loss_dict = {"total_loss": loss_dict}
        else:
            losses = sum(loss_dict.values())

    """
    If you need to accumulate gradients or do something similar, you can
    wrap the optimizer with your custom `zero_grad()` method.
    """

    if self.amp:
        self.grad_scaler.scale(losses).backward()
        if self.clip_grad_params is not None:
            self.grad_scaler.unscale_(self.optimizer)
            self.clip_grads(self.model.parameters())           
        if self.iter % self.batch_size_scale == 0:
            # print(self.iter)
            self.grad_scaler.step(self.optimizer)
            self.grad_scaler.update()
            self.optimizer.zero_grad()
    else:
        losses.backward()
        if self.clip_grad_params is not None:
            self.clip_grads(self.model.parameters())
        if self.iter % self.batch_size_scale == 0:
            # print(self.iter)
            self.optimizer.step()
            self.optimizer.zero_grad()

    self._write_metrics(loss_dict, data_time)

I think maybe it's better for DETR-like model to train on larger batch-size, like batch-size=8 & batch-size=16, and I'm closing this issue, feel free to reopen it if necessary

rentainhe commented 1 year ago

BTW, we will report this in the pinned issue~ @ustcwhy and thanks for reporting this issue

IDEA-Research / detrex

Convergence problem on coco with less gpus. #219