allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.61k stars 651 forks source link

Clearml offset in scalars for pipeline step with retry_on_failure #1054

Closed MaximeChurin closed 1 year ago

MaximeChurin commented 1 year ago

Describe the bug

Thanks to continue_last_task=0 I can abort and enqueue a task and my scalars are correctly reported after the previous ones. I tried to have the same behavior using a task created by a pipeline with retry_on_failure but on this setup, my scalars have an offset.

Offset in scalars Scalars should be reported each epoch in the range [1:100] but when a restart happens it added the last iteration in the order of 100k in my example

Logging of retry_on_failure on a pipeline Logging of retry_on_failure on a pipeline

To reproduce

Create a pipeline with tensorflow and tensorboard callback that fail every x epochs (use a custom callback to trigger that) while using model.fit with initial_epoch. I can actually provide a snippet if needed so let me know.

Expected behaviour

I would like to have no offset added by clearml regarding the value of last_iteration and let tensorboard manage that

Environment

Related Discussion

We had already discussed the previous bug for a single task in slack and I reopen the thread to discuss the pipeline one

ainoam commented 1 year ago

Thanks for letting us know @MaximeChurin.

Hope to have a fix out in a near release.

MaximeChurin commented 1 year ago

@alex-burlacu-clear-ml when do you plan to make a release even a rc one? After testing from main I confirm it works fine with your fix, thanks