Open antonlukyanov opened 2 years ago
Hi @antonlukyanov,
I've tried to reproduce your scenario with a simple script and couldn't, I used this: from clearml import Task, Logger from time import sleep import random t = Task.init(project_name='tests',task_name='continue test',reuse_last_task_id=True, continue_last_task=True) l = t.get_logger()
print('initial iteration {} last iteration {}'.format(t.get_initial_iteration(), t.get_last_iteration()))
for i in range(1,1000000): print(i) l.report_scalar(title='my_title',series='my_series',value=i+random.randrange(0,5),iteration=i) sleep(0.001)
print('initial iteration {} last iteration {}'.format(task.get_initial_iteration(), task.get_last_iteration()))
Can you also try to add this print after Task.init and see if iterations make sense when resuming?
Lastly, I tried looking for an example code for tf estimators and found only linear regression one, any easy example I can try to reproduce with?
Hi @erezalg Thanks for the reply. I personally noticed such behaviour with estimators whereas your code doesn't use them. Also it happens when training is aborted and resumed by running the same script again, not put into sleep. Let me come up with sample code a bit later.
@erezalg Here's the script to train DNNClassifier on MNIST data which reproduces the bug. TensorFlow version is 2.9.
import os
import dataclasses as dc
import numpy as np
import tensorflow as tf
import tensorflow.keras as tfk
from clearml import Task
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
@dc.dataclass
class Config:
batch_size = 32
learning_rate = 1e-4
model_directory = '/path/to/mnist_estimator'
#%%
task = Task.init(project_name='tf.estimator/DNNClassifier-MNIST',
task_type='training',
task_name='DNNClassifier',
reuse_last_task_id=True,
continue_last_task=True)
os.environ['CUDA_VISIBLE_DEVICES'] = ''
config = Config()
feature_columns = [tf.feature_column.numeric_column("x", shape=[28, 28])]
classifier = tf.estimator.DNNClassifier(
feature_columns=feature_columns,
hidden_units=[256, 32],
optimizer=tfk.optimizers.Adam(learning_rate=config.learning_rate),
n_classes=10,
dropout=0.1,
config=tf.estimator.RunConfig(
save_summary_steps=x_train.shape[0] / config.batch_size,
save_checkpoints_secs=10,
session_config=tf.compat.v1.ConfigProto(gpu_options=tf.compat.v1.GPUOptions(allow_growth=True)),
log_step_count_steps=1000,
),
model_dir=config.model_directory
)
train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
x={"x": x_train},
y=y_train,
num_epochs=None,
batch_size=config.batch_size,
shuffle=True,
)
test_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
x={"x": x_test},
y=y_test,
num_epochs=1,
shuffle=False
)
tf.estimator.train_and_evaluate(
classifier,
tf.estimator.TrainSpec(
input_fn=train_input_fn
),
tf.estimator.EvalSpec(
input_fn=test_input_fn,
steps=None,
throttle_secs=5
)
)
Hi @antonlukyanov,
Thanks for the code, we now are able to reproduce the issue. Will let you know once this issue is resolved
Describe the bug
I'm training a model with tf.estimator API. Then I abort training and restart it while reusing last task. All the code that I added is
After restarting training huge gaps appear in iteration axis (see the screenshot).
To reproduce
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data() y_train = y_train.astype(np.int32) y_test = y_test.astype(np.int32)
@dc.dataclass class Config: batch_size = 32 learning_rate = 1e-4 model_directory = '/path/to/mnist_estimator'
%%
task = Task.init(project_name='tf.estimator/DNNClassifier-MNIST', task_type='training', task_name='DNNClassifier', reuse_last_task_id=True, continue_last_task=True)
os.environ['CUDA_VISIBLE_DEVICES'] = '' config = Config() feature_columns = [tf.feature_column.numeric_column("x", shape=[28, 28])]
classifier = tf.estimator.DNNClassifier( feature_columns=feature_columns, hidden_units=[256, 32], optimizer=tfk.optimizers.Adam(learning_rate=config.learning_rate), n_classes=10, dropout=0.1, config=tf.estimator.RunConfig( save_summary_steps=x_train.shape[0] / config.batch_size, save_checkpoints_secs=10, session_config=tf.compat.v1.ConfigProto(gpu_options=tf.compat.v1.GPUOptions(allow_growth=True)), log_step_count_steps=1000, ), model_dir=config.model_directory )
train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn( x={"x": x_train}, y=y_train, num_epochs=None, batch_size=config.batch_size, shuffle=True, )
test_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn( x={"x": x_test}, y=y_test, num_epochs=1, shuffle=False )
tf.estimator.train_and_evaluate( classifier, tf.estimator.TrainSpec( input_fn=train_input_fn ), tf.estimator.EvalSpec( input_fn=test_input_fn, steps=None, throttle_secs=5 ) )