Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
https://www.supergradients.com
Apache License 2.0
4.53k stars 494 forks source link

Add more labels to a custom trained model #1746

Open rvryan67 opened 8 months ago

rvryan67 commented 8 months ago

💡 Your Question

I have a custom trained model model . It was trained using 13,000 images and labels and took 12 hours to train.

I want to add more training data (images and labels)

Is there a way to incrementally add a small amount of additional training data and re-run training without it taking 12 hours to complete?

Versions

No response

BloodAxe commented 8 months ago

You can always load a weights of the trained model from previous step and continue training it from that state.

model = models.get(...., checkpoint_path=<ABSOLUTE_PATH_TO_CHECKPOINT_FROM_PREVIOUS_TRAINING>)

Or via cmd-line if you are using YAML recipes: python -m super_gradients.train_from_recipe --config-name=YOUR_RECIPES checkpoint_params.checkpoint_path=<ABSOLUTE_PATH_TO_CHECKPOINT_FROM_PREVIOUS_TRAINING>

rvryan67 commented 8 months ago

Hi @BloodAxe

Thanks for the suggestion. Is this what you mean?

model = models.get("yolo_nas_l", num_classes=2, checkpoint_path=r"custommodel/ckpt_best.pth").cuda()

trainer.train(model=model, training_params=train_params, train_loader=train_data, valid_loader=val_data)

BloodAxe commented 8 months ago

Exactly

rvryan67 commented 8 months ago

The re-trained custom model is not giving me the results I expect

The original custom model predicts correctly, i.e it identifies an object with 0.9 confidence.

However, when I run the same prediction on the re-trained custom model I don't get any prediction.

I wonder am I missing something from my training_params

` train_params = {

ENABLING SILENT MODE

'silent_mode': False,
"average_best_models":True,
"warmup_mode": "linear_epoch_step",
"warmup_initial_lr": 1e-6,
"lr_warmup_epochs": 3,
"initial_lr": 5e-4,
"lr_mode": "cosine",
"cosine_final_lr_ratio": 0.1,
"optimizer": "Adam",
"optimizer_params": {"weight_decay": 0.0001},
"zero_weight_decay_on_bias_and_bn": True,
"ema": True,
"ema_params": {"decay": 0.9, "decay_type": "threshold"},

"max_epochs": EPOCHS,
"mixed_precision": True,
"loss": PPYoloELoss(
    use_static_assigner=False,
    # NOTE: num_classes needs to be defined here
    num_classes=len(dataset_params['classes']),
    reg_max=16
),
"valid_metrics_list": [
    DetectionMetrics_050(
        score_thres=0.1,
        top_k_predictions=300,
        # NOTE: num_classes needs to be defined here
        num_cls=len(dataset_params['classes']),
        normalize_targets=True,
        post_prediction_callback=PPYoloEPostPredictionCallback(
            score_threshold=0.01,
            nms_top_k=1000,
            max_predictions=300,
            nms_threshold=0.7
        )
    )
],
"metric_to_watch": 'mAP@0.50'

}

`

BloodAxe commented 8 months ago

The provided snippet is not enough to help. Please show the rest of code including a data loader's preparation (before and after you add more data) and tensorboard plots for regular training and with additional data

rvryan67 commented 8 months ago

I'm using an AWS Sagemaker Training job to train the model.

Here is the code I use to create the Training Job

from sagemaker.estimator import Estimator from sagemaker.pytorch import PyTorch from sagemaker.session import TrainingInput

train_input = TrainingInput(dataset_s3_uri)

estimator = PyTorch( entry_point="train.py", role=role, source_dir="./yolo-nas-model-scripts", instance_count=1, instance_type='ml.g4dn.12xlarge', framework_version="1.13.1", py_version="py39", sagemaker_session=sagemaker_session, input_mode='File', # FastFile causes a issue with writing label cache output_path=dataset_s3_uri+'/output', )

estimator.fit(train_input, job_name=job_name)

train.py attached, renamed to train.txt train.txt

rvryan67 commented 8 months ago

@BloodAxe

Is there a way to incrementally add a small amount of additional training data and re-run training without it taking 12 hours to complete?

My original question quoted above ^^

I have seen in the following discussion: https://github.com/ultralytics/ultralytics/issues/4554#issuecomment-1695218721

Currently, YOLOv8 does not have a feature for incremental learning

Is the same true of YOLO-NAS?

BloodAxe commented 8 months ago

So what you are looking for is continual learning. A technique which allow to train a model on a few data samples without forgetting the existing knowledge.

Unfortunately at the moment we don't supoort this.