allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.71k stars 657 forks source link

Task is already marked stopped when the callback from Task.register_abort_callback is called #1330

Open mads-oestergaard opened 2 months ago

mads-oestergaard commented 2 months ago

Describe the bug

It is not possible to modify the task (e.g. update and upload a checkpoint) in a callback registered with Task.register_abort_callback

Trying to save a checkpoint in the callback gives the following error: 2024-09-13 12:12:27,581 - clearml.model - WARNING - Could not update last created model in Task b281b21329e3470ebc8959e831f28ff8, Task status 'stopped' cannot be updated

To reproduce

Register a callback on the current task using something like this:

def on_abort_callback() -> None:
    print("Saving last checkpoint")
    trainer.save_checkpoint(
        self.last_filepath,
        weights_only=self.save_weights_only,
    )

    # Ensure that the trainer stops gracefully
    trainer.should_stop = True

print("Registering model checkpoint abort callback")
Task.current_task().register_abort_callback(on_abort_callback)

where trainer is a pytorch-lightning Trainer and the callback is registered in an extended lightning ModelCheckpoint (docs)

Expected behaviour

It should be possible to upload a model checkpoint to the ClearML server when a task is aborted in the abort callback function.

Current workaround is to mark the current task in_progress while saving checkpoint and then afterwards marking it stopped again. Not intuitive :-)

Environment