Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
https://www.supergradients.com
Apache License 2.0
4.54k stars 496 forks source link

Incorrect Checkpoint path #1604

Closed tztechno closed 11 months ago

tztechno commented 11 months ago

💡 Your Question

While training the model, even the Checkpoint path is correct, but FileNotFoundError occurs. Is there any way to avoid this error?

FileNotFoundError: Incorrect Checkpoint path: /kaggle/working/checkpoints/xxxxx/average_model.pth (This should be an absolute path)

checkpoint_path = os.path.abspath(os.path.join(config.CHECKPOINT_DIR, 
                                               config.EXPERIMENT_NAME, 
                                               'average_model.pth'))

best_model = models.get(config.MODEL_NAME, 
                        num_classes=config.NUM_CLASSES, 
                        checkpoint_path=checkpoint_path)

Versions

No response

BloodAxe commented 11 months ago

Most likely the file is not present at given location.

Another issue could be that if your xxxxx corresponds to a certain experiement, then it's path may contain a timestamp sub-folder. If you trained a model using trainer the right way to get the best checkpoint path is:

checkpoint_path=os.path.join(trainer.checkpoints_dir_path, "average_model.pth")

tztechno commented 11 months ago

Thank you. As you pointed, I found timestamp sub-folder generated. Previously there was no timestamp sub-folder. Since timestamp is not a fixed value, the setting of trainer.checkpoints_dir_path is essential but would be difficult to find it by myself. Now I can do custom training for YOLO-NAS with 'set "mixed_precision": True' even on the CPU. Thank you again.

pablopescador commented 9 months ago

Hola, he llegado a este hilo por casualidad, estoy siguiendo la documentacion en "https://docs.deci.ai/super-gradients/latest/documentation/source/Example_Classification.html#5-training-checkpointing-and-transfer-learning" y tenia el mismo error. Seria interesante corregir el parrafo siguiente. Muchas gracias por vuestro trabajo. Buen fin de semana. Saludos desde Munich ;) import os

model = models.get(model_name=Models.RESNET18, num_classes=10, checkpoint_path=os.path.join(CHECKPOINT_DIR, experiment_name, 'ckpt_latest.pth'))

training_params["resume"] = True training_params["max_epochs"] = 25

trainer.train(model=model, training_params=training_params, train_loader=train_dataloader, valid_loader=valid_dataloader)