allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.61k stars 651 forks source link

huggingface trainer hook calls task.close() prematurely #1116

Open nkgrush opened 1 year ago

nkgrush commented 1 year ago

Describe the bug

Huggingface Trainer class is integrated with clearml. When trainer.train() finishes (successfully), the trainer calls task.close(), making original clearml task unavailable. I am refering to this line specifically (permalink).

To reproduce

task = Task.init(
    project_name='project',
    task_name='task',
)
...
model = ...
dataset = ...
...
from transformers import Trainer
trainer_args = ...
trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    args=trainer_args,
)

print(task.status) # Running
trainer.train()
print(task.status) # Completed

# now the task object is dead for the most purposes

Expected behaviour

The main task should not be closed (making it unavailable) after the training is finished. This is especially important if there are multiple trainer runs or any custom actions are taken after training.

Environment

Independent

eugen-ajechiloae-clearml commented 11 months ago

Hi @nkgrush ! We have submitted a PR to huggingface related to this issue: https://github.com/huggingface/transformers/pull/26614