allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.65k stars 651 forks source link

Task is set to completed automatically when Huggingface Trainer is executed #967

Open meanna opened 1 year ago

meanna commented 1 year ago

Describe the bug

I have the Huggingface Trainer (from the transformers library) in my code, after the training is done I want to upload a model artifact, but it is not possible. I get this error.

Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=208a7835726347c59c1666302f0b9a81, artifacts=[{'key': 'model', 'type': 'string', 'uri': 'http://clearml.gpu.fra.ics.inovex.io:8081/KG_QA/fine-tune%20roberta.208a7835726347c59c1666302f0b9a81/artifacts/model/model.txt', 'content_size': 10, 'hash': 'b0d6dcfed49bb9415ec067e9d8969219c62176d9ce44da5a1fe672634112792d', 'timestamp': 1680620905, 'type_data': {'preview': 'merges.txt', 'content_type': 'text/plain'}}], force=True)

Seems like it is because the task status is completed. Also, it seems like the transformer library is connected to another clearml task, see: https://github.com/huggingface/transformers/blob/main/src/transformers/integrations.py

To reproduce

I tried to add clearml to this code. https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py

meanna commented 1 year ago

OK, fixed using task.mark_started()

It won't fix everything though. I found that with this bug, I can not log tables and other things, also the console output in clearml stops showing progress after the task is closed which is bad.

thepycoder commented 1 year ago

Hi @meanna !

Originally, this was done so you wouldn't override anything when running training twice from a notebook. In a notebook environment, the task can't know when to properly close, unless it is done manually.

That said, I think it makes little sense to have this in there in hindsight. If a notebook user wants to rerun training, they should manually close the task themselves.

So there are 2 options to fix this:

I think it makes sense to just do option 2, the notebook usecase is suboptimal either way and in this way it won't be in the way of users. What do you think?