allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.61k stars 651 forks source link

After model finishes training, no more scalars, etc. can be reported #1119

Open johnml1135 opened 1 year ago

johnml1135 commented 1 year ago

Describe the bug

self.clearml_task.get_logger().report_single_value(name="prevalue",value=1)
with create_model_trainer(parallel_corpus) as model_trainer:
    model_trainer.train(check_canceled=self.check_canceled)
    model_trainer.save()
self.clearml_task.get_logger().report_single_value(name="postvalue",value=2)

The prevalue is registered, but not the postvalue

eugen-ajechiloae-clearml commented 1 year ago

Hi @johnml1135 ! I think that the logger is not flushed properly on program exit. Can you try doing self.clearml_task.get_logger().flush(wait=True) at the end of the script until we fix this?

johnml1135 commented 1 year ago

We ended up finding another way around it - thank you for figuring out what was going on - we may need it again in the future.

johnml1135 commented 11 months ago

That fix does not appear to work.

The first main issue (that the job closes at training end) is here in Hugging face Transformers: https://github.com/huggingface/transformers/blob/0ebee8b93358b6ef0182398b8fcbd7afd64c0f97/src/transformers/integrations/integration_utils.py#L1488-L1493

I made a pull request to resolve the issue in Hugging face: https://github.com/huggingface/transformers/pull/26763 For a temp fix you can use the solution from here: https://github.com/sillsdev/serval/issues/44#issuecomment-1758761393

The other aspect of this issue is that when reopening a task, you can actually write scalars, etc. to it.