Task is set to completed automatically when Huggingface Trainer is executed

meanna commented 1 year ago

Describe the bug

I have the Huggingface Trainer (from the transformers library) in my code, after the training is done I want to upload a model artifact, but it is not possible. I get this error.

Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=208a7835726347c59c1666302f0b9a81, artifacts=[{'key': 'model', 'type': 'string', 'uri': 'http://clearml.gpu.fra.ics.inovex.io:8081/KG_QA/fine-tune%20roberta.208a7835726347c59c1666302f0b9a81/artifacts/model/model.txt', 'content_size': 10, 'hash': 'b0d6dcfed49bb9415ec067e9d8969219c62176d9ce44da5a1fe672634112792d', 'timestamp': 1680620905, 'type_data': {'preview': 'merges.txt', 'content_type': 'text/plain'}}], force=True)

Seems like it is because the task status is completed. Also, it seems like the transformer library is connected to another clearml task, see: https://github.com/huggingface/transformers/blob/main/src/transformers/integrations.py

To reproduce

I tried to add clearml to this code. https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py

meanna commented 1 year ago

OK, fixed using task.mark_started()

It won't fix everything though. I found that with this bug, I can not log tables and other things, also the console output in clearml stops showing progress after the task is closed which is bad.

thepycoder commented 1 year ago

Hi @meanna !

Originally, this was done so you wouldn't override anything when running training twice from a notebook. In a notebook environment, the task can't know when to properly close, unless it is done manually.

That said, I think it makes little sense to have this in there in hindsight. If a notebook user wants to rerun training, they should manually close the task themselves.

So there are 2 options to fix this:

Either add an environment variable to enable or disable auto-closing
Get rid of the auto-close altogether

I think it makes sense to just do option 2, the notebook usecase is suboptimal either way and in this way it won't be in the way of users. What do you think?

allegroai / clearml

Task is set to completed automatically when Huggingface Trainer is executed #967

Describe the bug

To reproduce