ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
It is not possible to modify the task (e.g. update and upload a checkpoint) in a callback registered with Task.register_abort_callback
Trying to save a checkpoint in the callback gives the following error:
2024-09-13 12:12:27,581 - clearml.model - WARNING - Could not update last created model in Task b281b21329e3470ebc8959e831f28ff8, Task status 'stopped' cannot be updated
To reproduce
Register a callback on the current task using something like this:
def on_abort_callback() -> None:
print("Saving last checkpoint")
trainer.save_checkpoint(
self.last_filepath,
weights_only=self.save_weights_only,
)
# Ensure that the trainer stops gracefully
trainer.should_stop = True
print("Registering model checkpoint abort callback")
Task.current_task().register_abort_callback(on_abort_callback)
where trainer is a pytorch-lightning Trainer and the callback is registered in an extended lightning ModelCheckpoint (docs)
Expected behaviour
It should be possible to upload a model checkpoint to the ClearML server when a task is aborted in the abort callback function.
Current workaround is to mark the current task in_progress while saving checkpoint and then afterwards marking it stopped again. Not intuitive :-)
Describe the bug
It is not possible to modify the task (e.g. update and upload a checkpoint) in a callback registered with
Task.register_abort_callback
Trying to save a checkpoint in the callback gives the following error:
2024-09-13 12:12:27,581 - clearml.model - WARNING - Could not update last created model in Task b281b21329e3470ebc8959e831f28ff8, Task status 'stopped' cannot be updated
To reproduce
Register a callback on the current task using something like this:
where trainer is a pytorch-lightning Trainer and the callback is registered in an extended lightning ModelCheckpoint (docs)
Expected behaviour
It should be possible to upload a model checkpoint to the ClearML server when a task is aborted in the abort callback function.
Current workaround is to mark the current task
in_progress
while saving checkpoint and then afterwards marking itstopped
again. Not intuitive :-)Environment
Related Discussion
https://clearml.slack.com/archives/CTK20V944/p1726571061754989