After several epochs all losses and metrics go to 0 while training YOLO NAS on a custom dataset

angelinager commented 1 year ago

💡 Your Question

Hi,

I'm trying to train YOLO NAS on a custom dataset, that basically consists of two different datasets with 17 classes in the first one and 12 new classes in the second one. My task is to take the whole first dataset and add 10% of the second dataset to it (since the second one contains new classes but has far more images so I don't want to take all of them, but just a random portion) and treat it as a single input dataset for the model. I wrote two custom Dataset classes and actually already have posted a question #1056 about varying number of bboxes, and your suggestion on padding helped a lot. Thanks!

And now I have my datasets (I sampled 10% of the second dataset using torch.utils.data.Subset()), which I later combine in one dataset using torch.utils.data.ConcatDataset(), but there is another unexpected issue: the model starts training just okay but after several epochs I see that practically all losses and metrics drop to zero.

Also, I am using a pre-trained COCO weights. I've seen losses go to NaN when the learning rate is too high, but I've never seen all the metrics to be zero. I tried to decrease the learning rate (my initial lr was 5e-4, I tried 5e-5, 5e-6), removed warmup epochs, and I also tried to train the model from scratch without using COCO weights - the model eventually goes to zeros in any of those cases.

Now I start to suspect that I did something wrong in either Dataset() or Dataloader(), but I just can't understand what is wrong. So your help or any useful debugging suggestions will be much appreciated.

Versions

No response

angelinager commented 1 year ago

In case you need my code for Dataset classes along with the helper function for padding (My datasets are in YOLO format):

def pad_targets(targets: np.ndarray, max_targets: int) -> np.ndarray:

    padded_targets = np.zeros((max_targets, targets.shape[-1]))
    padded_targets[range(len(targets))[: max_targets]] = targets[: max_targets]
    padded_targets = np.ascontiguousarray(padded_targets, dtype=np.float32)

    return padded_targets

class KedenCustomDataset(Dataset):

    def __init__(self, data_folder):
        self.image_folder = os.path.join(data_folder, "images")
        self.annot_folder = os.path.join(data_folder, "labels")
        # Read data files
        self.images = os.listdir(self.image_folder)
        # Read annotation files
        self.annotations = os.listdir(self.annot_folder)

    def __len__(self):
        return min(len(self.images), len(self.annotations))

    def __getitem__(self, i):
        resize_dim = 640
        pad_dim = 6

        # Read image and label
        image_filename = self.images[i]
        annot_filename = image_filename.split('.tif')[0] + '.txt'

        image_path = os.path.join(self.image_folder, image_filename)
        image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
        image = cv2.cvtColor(image, cv2.COLOR_GRAY2BGR)
        image = cv2.resize(image, (resize_dim, resize_dim))
        image_tensor = torch.tensor(np.array(image, dtype=np.float32)).permute(2, 0, 1).float()

        if annot_filename in self.annotations:

            labels_path = os.path.join(self.annot_folder, annot_filename)
            labels = np.loadtxt(labels_path, delimiter=' ')

            if labels.ndim == 1:
                labels = np.reshape(labels, (1, -1))

            labels = pad_targets(labels, pad_dim)
        else:
            labels = np.zeros((pad_dim, 5))

        return image_tensor, labels

I have grayscale images so I convert image to grayscale while still keeping three channels. Also, there are some images that don't have annotation (but there's just a couple of them), so I return zeros of labels matrix (I don't know whether it's a right way to handle those images though).

The Dataset class for a second dataset is very similar to the first one except that the image files are in ".png" format instead of ".tif", and in the constructor I'm also passing the number of classes from the first dataset to shift my labels in the second dataset, since labels start from 0 in both datasets.

class PidrayCustomDataset(Dataset):

    def __init__(self, data_folder, keden_labels_num):

        self.keden_labels_num = keden_labels_num
        self.image_folder = os.path.join(data_folder, "images")
        self.annot_folder = os.path.join(data_folder, "labels")

        # Read data files
        self.images = os.listdir(self.image_folder)
        # Read annotation files
        self.annotations = os.listdir(self.annot_folder)     

    def __len__(self):
        return min(len(self.images), len(self.annotations))

    def __getitem__(self, i):

        resize_dim = 640
        pad_dim = 6

        # Read image and label
        image_filename = self.images[i]
        annot_filename = image_filename.split('.png')[0] + '.txt'

        image_path = os.path.join(self.image_folder, image_filename)

        image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
        image = cv2.cvtColor(image, cv2.COLOR_GRAY2BGR)
        image = cv2.resize(image, (resize_dim, resize_dim))
        # image = Image.open(image_path, mode='r').resize((320, 320))
        image_tensor = torch.tensor(np.array(image, dtype=np.float32)).permute(2, 0, 1).float()

        if annot_filename in self.annotations:

            labels_path = os.path.join(self.annot_folder, annot_filename)
            labels = np.loadtxt(labels_path, delimiter=' ')

            if labels.ndim == 1:
                labels = np.reshape(labels, (1, -1))

            labels[:, 0] =  labels[:, 0] + self.keden_labels_num

            labels = pad_targets(labels, pad_dim)

        else:
            # image_tensor = torch.zeros(3, resize_dim, resize_dim)
            labels = np.zeros((pad_dim, 5))

        return image_tensor, labels

angelinager commented 1 year ago

And the training script:

import torch
import random
from torch.utils.data import Dataset, DataLoader

from super_gradients.training import models
from super_gradients.training import Trainer
from super_gradients.training.utils.detection_utils import DetectionCollateFN
from super_gradients.training.losses import PPYoloELoss
from super_gradients.training.metrics import DetectionMetrics_050
from super_gradients.training.models.detection_models.pp_yolo_e import PPYoloEPostPredictionCallback

import my_custom_datasets 

BATCH_SIZE = 16

CHECKPOINT_DIR = 'checkpoints'
exper_name = '...'

trainer = Trainer(experiment_name=exper_name, ckpt_root_dir=CHECKPOINT_DIR)

keden_labels = [...17 classes...]
pidray_labels =  [...12 classes...]
classes = keden_labels + pidray_labels 

keden_train_dir = ...
keden_val_dir = ...

pidray_train_dir = ...
pidray_val_dir = ...

train_params = {
    # ENABLING SILENT MODE
    "max_epochs": 50,
    'silent_mode': False,
    "average_best_models": True,
    "warmup_mode": "linear_epoch_step",
    "warmup_initial_lr": 1e-6,
    "lr_warmup_epochs": 0,
    "initial_lr": 5e-4,
    "lr_mode": "cosine",
    "cosine_final_lr_ratio": 0.1,
    "optimizer": "Adam",
    "optimizer_params": {"weight_decay": 0.0001},
    "zero_weight_decay_on_bias_and_bn": True,
    "ema": True,
    "ema_params": {"decay": 0.9, "decay_type": "threshold"},
    "mixed_precision": True,
    "loss": PPYoloELoss(
        use_static_assigner=False,
        # NOTE: num_classes needs to be defined here
        num_classes=len(classes),
        reg_max=16
    ),
    "valid_metrics_list": [
        DetectionMetrics_050(
            score_thres=0.1,
            top_k_predictions=300,
            # NOTE: num_classes needs to be defined here
            num_cls=len(classes),
            normalize_targets=True,
            post_prediction_callback=PPYoloEPostPredictionCallback(
                score_threshold=0.01,
                nms_top_k=1000,
                max_predictions=300,
                nms_threshold=0.7
            )
        )
    ],
    "metric_to_watch": 'mAP@0.50',

    "sg_logger": "clearml_sg_logger", # ClearML Logger, see class ClearMLSGLogger for details
    "sg_logger_params": # parameters that will be passes to __init__ of the logger 
      {
        "project_name": exper_name, # ClearML project name
        "save_checkpoints_remote": True,
        "save_tensorboard_remote": True,
        "save_logs_remote": True,
      } 
}

# Prepare Datasets
keden_train_dataset = my_custom_datasets.KedenCustomDataset(keden_train_dir)
keden_val_dataset = my_custom_datasets.KedenCustomDataset(keden_val_dir)

pidray_train_dataset = my_custom_datasets.PidrayCustomDataset(pidray_train_dir, len(keden_labels))
pidray_val_dataset = my_custom_datasets.PidrayCustomDataset(pidray_val_dir, len(keden_labels))

# Randomly select 10% of indices from Pidray dataset
percent_pidray = 0.1
pidray_train_rand_selected = random.sample([i for i in range(len(pidray_train_dataset))], int(len(pidray_train_dataset)*percent_pidray))
pidray_val_rand_selected = random.sample([i for i in range(len(pidray_val_dataset))], int(len(pidray_val_dataset)*percent_pidray))

#  Create Pidray subsets for train and validation
pidray_train_subset = torch.utils.data.Subset(pidray_train_dataset, pidray_train_rand_selected)
pidray_val_subset = torch.utils.data.Subset(pidray_val_dataset, pidray_val_rand_selected)

# Concatenate the whole Keden dataset and 10% of a Pidray dataset
concat_train_dataset = torch.utils.data.ConcatDataset([keden_train_dataset, pidray_train_subset])
concat_val_dataset = torch.utils.data.ConcatDataset([keden_val_dataset, pidray_val_subset])

train_dataloader = DataLoader(concat_train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, collate_fn=DetectionCollateFN()) 
val_dataloader = DataLoader(concat_val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, collate_fn=DetectionCollateFN())

model = models.get('yolo_nas_l', 
                   num_classes=len(classes),
                   pretrained_weights="coco")

trainer.train(model=model, 
              training_params=train_params, 
              train_loader=train_dataloader, 
              valid_loader=val_dataloader)

wematan commented 1 year ago

@angelinager I'm having an issue with YOLO-Nas training loss goes to 'nan', it can happen after 1-2 epochs. I've traced the problem by debugging a bit and the problem resides in the mixed precision part of the training loop. when disabling mixed-precision, the loss doesn't converge to Nan. on the other hand it is not feasible if your dealing with very large datasets (as i do).

BTW i'm also training YOLO-X and it didn't happen yet (epoch 5). if this is a consistent issue with YOLO-Nas, this kind of shadows the competitiveness of yolo-nas over other architectures (which might affect sg framework adoption as well).

harpreetsahota204 commented 1 year ago

Hi @angelinager @wematan

Thanks for opening an issue for SG, and sharing your experience here. I'm formally gathering some feedback on SuperGradients and YOLO-NAS.

Would you be down for a quick call to chat about your experience?

If a call doesn't work for you, no worries. I've got a short survey you could fill out: https://bit.ly/sgyn-feedback.

I know you’re super busy, but your input will help us shape the direction of SuperGradients and make it as useful as possible for you.

I appreciate your time and feedback. Let me know what works for you.

Cheers,

Harpreet

BloodAxe commented 1 year ago

Since you are using custom dataset classes we cannot provide support here (Or putting it straight - debug your code for you). I think you have some sort of a problem in your target boxes format that you output from the dataset. Clearly the IoU loss is zero indicates model cannot train a regressor at all. I suggest you start from double checking what output comes from COCODetectionDataset class to match with what you actually have.

Should you re-factor your training code to use SG datasets and provide a colab example where we can reproduce the training - feel free to re-open the issue.

legenda971 commented 7 months ago

I've been working with the YOLO-NAS-N model and have same issue. After approximately 1000+ epochs of training, I started observing NaN values in one of the loss functions.Upon closer examination, I discovered that the issue seems to be related to the Batch Normalization (BatchNorm) layers within the model.

Deci-AI / super-gradients

After several epochs all losses and metrics go to 0 while training YOLO NAS on a custom dataset #1110

💡 Your Question

Versions