Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
https://www.supergradients.com
Apache License 2.0
4.54k stars 496 forks source link

I tried to train my custom dataset while meeting KeyError:'mAP@0.50' #1636

Closed AmuzeLu closed 7 months ago

AmuzeLu commented 11 months ago

πŸ’‘ Your Question

The error log is attached as below. [2023-11-14 11:19:42] INFO - crash_tips_setup.py - Crash tips is enabled. You can set your environment variable to CRASH_HANDLER=FALSE to disable it [2023-11-14 11:19:42] WARNING - __init__.py - Failed to import pytorch_quantization [2023-11-14 11:19:43] WARNING - calibrator.py - Failed to import pytorch_quantization [2023-11-14 11:19:43] WARNING - export.py - Failed to import pytorch_quantization [2023-11-14 11:19:43] WARNING - selective_quantization_utils.py - Failed to import pytorch_quantization [2023-11-14 11:19:43] INFO - detection_dataset.py - Dataset Initialization in progress.cache_annotations=Truecauses the process to take longer due to full dataset indexing. Indexing dataset annotations: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 87/87 [00:00<00:00, 17532.53it/s] [2023-11-14 11:19:43] INFO - detection_dataset.py - Dataset Initialization in progress.cache_annotations=Truecauses the process to take longer due to full dataset indexing. Indexing dataset annotations: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<00:00, 10174.18it/s] [2023-11-14 11:19:43] INFO - checkpoint_utils.py - License Notification: YOLO-NAS pre-trained weights are subjected to the specific license terms and conditions detailed in https://github.com/Deci-AI/super-gradients/blob/master/LICENSE.YOLONAS.md By downloading the pre-trained weight files you agree to comply with these terms. [2023-11-14 11:19:43] INFO - checkpoint_utils.py - Successfully loaded pretrained weights for architecture yolo_nas_s [2023-11-14 11:19:44] INFO - sg_trainer.py - Starting a new run withrun_id=RUN_20231114_111944_239051` [2023-11-14 11:19:44] INFO - sg_trainer.py - Checkpoints directory: checkpoints/ylsff2/RUN_20231114_111944_239051 [2023-11-14 11:19:44] INFO - sg_trainer.py - Using EMA with params {'decay': 0.9, 'decay_type': 'threshold'} The console stream is now moved to checkpoints/ylsff2/RUN_20231114_111944_239051/console_Nov14_11_19_44.txt /home/lu/anaconda3/envs/sg/lib/python3.8/site-packages/numpy/lib/arraypad.py:487: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. x = np.array(x) /home/lu/anaconda3/envs/sg/lib/python3.8/site-packages/numpy/lib/arraypad.py:487: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. x = np.array(x) [2023-11-14 11:19:47] INFO - sg_trainer_utils.py - TRAINING PARAMETERS:

[2023-11-14 11:19:47] INFO - sg_trainer.py - Started training for 201 epochs (0/200)

Train epoch 0: 0%| | 0/5 [00:00<?, ?it/s]/home/lu/anaconda3/envs/sg/lib/python3.8/site-packages/numpy/lib/arraypad.py:487: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. x = np.array(x) /home/lu/anaconda3/envs/sg/lib/python3.8/site-packages/numpy/lib/arraypad.py:487: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. x = np.array(x) Train epoch 0: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:04<00:00, 1.19it/s, PPYoloELoss/loss=4, PPYoloELoss/loss_cls=2.11, PPYoloELoss/loss_dfl=0.851, PPYoloELoss/loss_iou=1.04, gpu_mem=5.47] Validating: | | 0/0 [00:00<?, ?it/s] /home/lu/anaconda3/envs/sg/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: The compute method of metric DetectionMetrics_050 was called before the update method which may lead to errors, as metric states have not yet been updated. warnings.warn(*args, **kwargs) [2023-11-14 11:19:51] INFO - base_sg_logger.py - [CLEANUP] - Successfully stopped system monitoring process [2023-11-14 11:19:51] ERROR - sg_trainer_utils.py - Uncaught exception Traceback (most recent call last): File "/home/lu/workspace/yolonas/test2.py", line 107, in trainer.train( File "/home/lu/anaconda3/envs/sg/lib/python3.8/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 1527, in train self._write_to_disk_operations( File "/home/lu/anaconda3/envs/sg/lib/python3.8/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 1963, in _write_to_disk_operations self._save_checkpoint( File "/home/lu/anaconda3/envs/sg/lib/python3.8/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 661, in _save_checkpoint curr_tracked_metric = float(validation_results_dict[self.metric_to_watch]) KeyError: 'mAP@0.50' Traceback (most recent call last): File "/home/lu/workspace/yolonas/test2.py", line 107, in trainer.train( File "/home/lu/anaconda3/envs/sg/lib/python3.8/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 1527, in train self._write_to_disk_operations( File "/home/lu/anaconda3/envs/sg/lib/python3.8/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 1963, in _write_to_disk_operations self._save_checkpoint( File "/home/lu/anaconda3/envs/sg/lib/python3.8/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 661, in _save_checkpoint curr_tracked_metric = float(validation_results_dict[self.metric_to_watch]) KeyError: 'mAP@0.50' `

And the code is as below.

import os

import requests
import torch
from PIL import Image

from super_gradients.training import Trainer, dataloaders, models
from super_gradients.training.dataloaders.dataloaders import (
    coco_detection_yolo_format_train, coco_detection_yolo_format_val
)
from super_gradients.training.losses import PPYoloELoss
from super_gradients.training.metrics import DetectionMetrics_050
from super_gradients.training.models.detection_models.pp_yolo_e import PPYoloEPostPredictionCallback

class config:
    CHECKPOINT_DIR = 'checkpoints'  
    EXPERIMENT_NAME = 'ylsff2'    
    DATA_DIR = '/home/lu/workspace/data/ylsff_all/'   
    TRAIN_IMAGES_DIR = 'train/images'   
    TRAIN_LABELS_DIR = 'train/labels'   

    VAL_IMAGES_DIR = 'val/images'
    VAL_LABELS_DIR = 'val/labels'

    CLASSES = ['ylsff']

    NUM_CLASSES = len(CLASSES)

    DATALOADER_PARAMS = {
        'batch_size': 16,
        'num_workers': 2
    }

    MODEL_NAME = 'yolo_nas_s'
    PRETRAINED_WEIGHTS = 'coco'

trainer = Trainer(experiment_name=config.EXPERIMENT_NAME, ckpt_root_dir=config.CHECKPOINT_DIR)

train_data = coco_detection_yolo_format_train(
    dataset_params={
        'data_dir': config.DATA_DIR,
        'images_dir': config.TRAIN_IMAGES_DIR,
        'labels_dir': config.TRAIN_LABELS_DIR,
        'classes': config.CLASSES
    },
    dataloader_params=config.DATALOADER_PARAMS
)

val_data = coco_detection_yolo_format_val(
    dataset_params={
        'data_dir': config.DATA_DIR,
        'images_dir': config.VAL_IMAGES_DIR,
        'labels_dir': config.VAL_LABELS_DIR,
        'classes': config.CLASSES
    },
    dataloader_params=config.DATALOADER_PARAMS
)

dataloader_params = config.DATALOADER_PARAMS

model = models.get(
    config.MODEL_NAME,
    num_classes=config.NUM_CLASSES,
    pretrained_weights=config.PRETRAINED_WEIGHTS,
)

train_params = {
    "average_best_models": True,
    "warmup_model": "linear_epoch_step",
    "warmup_initial_lr": 1e-6,
    "lr_warmup_epochs": 3,
    "initial_lr": 5e-4,
    # "lr_mode": "cosine",
    "lr_mode": "CosineLRScheduler",
    "cosine_final_lr_ratio": 0.1,
    "optimizer": "Adam",
    "optimizer_params": {"weight_decay": 0.0001},
    "zero_weight_decay_on_bias_and_bn": True,
    "ema": True,
    "ema_params": {"decay": 0.9, "decay_type": "threshold"},

    "max_epochs": 200,
    "mixed_precision": True,
    "loss": PPYoloELoss(
        use_static_assigner=False,
        num_classes=config.NUM_CLASSES,
        reg_max=16
    ),
    "valid_metrics_list": [
        DetectionMetrics_050(
            score_thres=0.3,
            top_k_predictions=300,
            num_cls=config.NUM_CLASSES,
            normalize_targets=True,
            post_prediction_callback=PPYoloEPostPredictionCallback(
                score_threshold=0.01,
                nms_top_k=1000,
                max_predictions=300,
                nms_threshold=0.5
            )
        )
    ],
    "metric_to_watch": 'mAP@0.50'
    # 'metric_to_watch': 'PPYoloELoss/loss_cls'
}

trainer.train(
    model = model,
    training_params = train_params,
    train_loader = train_data,
    valid_loader = val_data
)

# device = torch.device("cuda:0")
# best_model = models.get(
#     config.MODEL_NAME, num_classes=config.NUM_CLASSES,
#     checkpoint_path=os.path.join(config.CHECKPOINT_DIR, config.EXPERIMENT_NAME, 'average_model.pth')).to(device)
#
# best_model.predict('/home/lu/workspace/data/ylsff_all/val/images/', conf=0.2).show()

If I comment out valid_metric_list and modify metric_to_watch as PPYoloELoss/loss_cls, it can run while some errors exist when validating, the log is as follows: ` SUMMARY OF EPOCH 1 β”œβ”€β”€ Train β”‚ β”œβ”€β”€ Ppyoloeloss/loss_cls = 1.8052 β”‚ β”‚ β”œβ”€β”€ Epoch N-1 = 1.8946 (β†˜ -0.0894) β”‚ β”‚ └── Best until now = 1.8946 (β†˜ -0.0894) β”‚ β”œβ”€β”€ Ppyoloeloss/loss_iou = 1.0784 β”‚ β”‚ β”œβ”€β”€ Epoch N-1 = 1.0313 (β†— 0.0471) β”‚ β”‚ └── Best until now = 1.0313 (β†— 0.0471) β”‚ β”œβ”€β”€ Ppyoloeloss/loss_dfl = 0.9002 β”‚ β”‚ β”œβ”€β”€ Epoch N-1 = 0.8675 (β†— 0.0327) β”‚ β”‚ └── Best until now = 0.8675 (β†— 0.0327) β”‚ └── Ppyoloeloss/loss = 3.7839 β”‚ β”œβ”€β”€ Epoch N-1 = 3.7934 (β†˜ -0.0095) β”‚ └── Best until now = 3.7934 (β†˜ -0.0095) └── Validation β”œβ”€β”€ Ppyoloeloss/loss_cls = 0 β”‚ β”œβ”€β”€ Epoch N-1 = 0 (= 0) β”‚ └── Best until now = 0 (= 0) β”œβ”€β”€ Ppyoloeloss/loss_iou = None β”œβ”€β”€ Ppyoloeloss/loss_dfl = None └── Ppyoloeloss/loss = None

`

Versions

No response

BloodAxe commented 10 months ago

Could be that metric_to_watch should be "mAP@0.5" (without trailing zero)

Louis-Dupont commented 9 months ago

@AmuzeLu did you try? Does this fix your error ?