allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.69k stars 655 forks source link

Restarting training and reusing last task creates gaps in iteration axis in scalars section #762

Open antonlukyanov opened 2 years ago

antonlukyanov commented 2 years ago

Describe the bug

I'm training a model with tf.estimator API. Then I abort training and restart it while reusing last task. All the code that I added is

task = Task.init(project_name='OCR/CRNN',
                 task_type='training',
                 task_name='CRNN from scratch',
                 reuse_last_task_id=True,
                 continue_last_task=True)

After restarting training huge gaps appear in iteration axis (see the screenshot).

image

To reproduce

  1. Use the following sample script:
    
    import os
    import dataclasses as dc
    import numpy as np
    import tensorflow as tf
    import tensorflow.keras as tfk
    from clearml import Task

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data() y_train = y_train.astype(np.int32) y_test = y_test.astype(np.int32)

@dc.dataclass class Config: batch_size = 32 learning_rate = 1e-4 model_directory = '/path/to/mnist_estimator'

%%

task = Task.init(project_name='tf.estimator/DNNClassifier-MNIST', task_type='training', task_name='DNNClassifier', reuse_last_task_id=True, continue_last_task=True)

os.environ['CUDA_VISIBLE_DEVICES'] = '' config = Config() feature_columns = [tf.feature_column.numeric_column("x", shape=[28, 28])]

classifier = tf.estimator.DNNClassifier( feature_columns=feature_columns, hidden_units=[256, 32], optimizer=tfk.optimizers.Adam(learning_rate=config.learning_rate), n_classes=10, dropout=0.1, config=tf.estimator.RunConfig( save_summary_steps=x_train.shape[0] / config.batch_size, save_checkpoints_secs=10, session_config=tf.compat.v1.ConfigProto(gpu_options=tf.compat.v1.GPUOptions(allow_growth=True)), log_step_count_steps=1000, ), model_dir=config.model_directory )

train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn( x={"x": x_train}, y=y_train, num_epochs=None, batch_size=config.batch_size, shuffle=True, )

test_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn( x={"x": x_test}, y=y_test, num_epochs=1, shuffle=False )

tf.estimator.train_and_evaluate( classifier, tf.estimator.TrainSpec( input_fn=train_input_fn ), tf.estimator.EvalSpec( input_fn=test_input_fn, steps=None, throttle_secs=5 ) )


2. Train the model with tf.estimator API
3. Abort training
4. Continue Training
5. Huge gaps appear in iteration axis.

## Expected behaviour
Graphs don't contain huge gaps and iteration number (global step in this case) is correctly obtained.

## Environment
* Server type: self hosted
* ClearML SDK Version: 1.6.4
* ClearML Server Version: WebApp: 1.6.0-213 • Server: 1.6.0-213 • API: 2.20
* Python Version: 3.9.12
* OS: Linux
erezalg commented 2 years ago

Hi @antonlukyanov,

I've tried to reproduce your scenario with a simple script and couldn't, I used this: from clearml import Task, Logger from time import sleep import random t = Task.init(project_name='tests',task_name='continue test',reuse_last_task_id=True, continue_last_task=True) l = t.get_logger()

print('initial iteration {} last iteration {}'.format(t.get_initial_iteration(), t.get_last_iteration()))

for i in range(1,1000000): print(i) l.report_scalar(title='my_title',series='my_series',value=i+random.randrange(0,5),iteration=i) sleep(0.001)

print('initial iteration {} last iteration {}'.format(task.get_initial_iteration(), task.get_last_iteration()))

Can you also try to add this print after Task.init and see if iterations make sense when resuming?

Lastly, I tried looking for an example code for tf estimators and found only linear regression one, any easy example I can try to reproduce with?

antonlukyanov commented 2 years ago

Hi @erezalg Thanks for the reply. I personally noticed such behaviour with estimators whereas your code doesn't use them. Also it happens when training is aborted and resumed by running the same script again, not put into sleep. Let me come up with sample code a bit later.

antonlukyanov commented 2 years ago

@erezalg Here's the script to train DNNClassifier on MNIST data which reproduces the bug. TensorFlow version is 2.9.

import os
import dataclasses as dc
import numpy as np
import tensorflow as tf
import tensorflow.keras as tfk
from clearml import Task

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)

@dc.dataclass
class Config:
    batch_size = 32
    learning_rate = 1e-4
    model_directory = '/path/to/mnist_estimator'

#%%
task = Task.init(project_name='tf.estimator/DNNClassifier-MNIST',
                 task_type='training',
                 task_name='DNNClassifier',
                 reuse_last_task_id=True,
                 continue_last_task=True)

os.environ['CUDA_VISIBLE_DEVICES'] = ''
config = Config()
feature_columns = [tf.feature_column.numeric_column("x", shape=[28, 28])]

classifier = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=[256, 32],
    optimizer=tfk.optimizers.Adam(learning_rate=config.learning_rate),
    n_classes=10,
    dropout=0.1,
    config=tf.estimator.RunConfig(
        save_summary_steps=x_train.shape[0] / config.batch_size,
        save_checkpoints_secs=10,
        session_config=tf.compat.v1.ConfigProto(gpu_options=tf.compat.v1.GPUOptions(allow_growth=True)),
        log_step_count_steps=1000,
    ),
    model_dir=config.model_directory
)

train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={"x": x_train},
    y=y_train,
    num_epochs=None,
    batch_size=config.batch_size,
    shuffle=True,
)

test_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={"x": x_test},
    y=y_test,
    num_epochs=1,
    shuffle=False
)

tf.estimator.train_and_evaluate(
    classifier,
    tf.estimator.TrainSpec(
        input_fn=train_input_fn
    ),
    tf.estimator.EvalSpec(
        input_fn=test_input_fn,
        steps=None,
        throttle_secs=5
    )
)

image

erezalg commented 2 years ago

Hi @antonlukyanov,

Thanks for the code, we now are able to reproduce the issue. Will let you know once this issue is resolved