Clearml offset in scalars for pipeline step with retry_on_failure

MaximeChurin commented 1 year ago

Describe the bug

Thanks to continue_last_task=0 I can abort and enqueue a task and my scalars are correctly reported after the previous ones. I tried to have the same behavior using a task created by a pipeline with retry_on_failure but on this setup, my scalars have an offset.

Offset in scalars Scalars should be reported each epoch in the range [1:100] but when a restart happens it added the last iteration in the order of 100k in my example

Logging of retry_on_failure on a pipeline

To reproduce

Create a pipeline with tensorflow and tensorboard callback that fail every x epochs (use a custom callback to trigger that) while using model.fit with initial_epoch. I can actually provide a snippet if needed so let me know.

Expected behaviour

I would like to have no offset added by clearml regarding the value of last_iteration and let tensorboard manage that

Environment

Server type: self-hosted
ClearML SDK Version: clearml==1.11.0
ClearML Server Version (Only for self hosted): 1.9.2-317
Python Version: 3.8.16
OS (Windows \ Linux \ Macos): Linux

Related Discussion

We had already discussed the previous bug for a single task in slack and I reopen the thread to discuss the pipeline one

ainoam commented 1 year ago

Thanks for letting us know @MaximeChurin.

Hope to have a fix out in a near release.

MaximeChurin commented 1 year ago

@alex-burlacu-clear-ml when do you plan to make a release even a rc one? After testing from main I confirm it works fine with your fix, thanks

allegroai / clearml