awslabs / sagemaker-debugger

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors
Apache License 2.0
158 stars 82 forks source link

tf.keras saves step at end of batch #144

Open jarednielsen opened 4 years ago

jarednielsen commented 4 years ago

Running the following script with tensorflow==1.15.0:

import tensorflow.compat.v2 as tf
import smdebug.tensorflow as smd
from tempfile import TemporaryDirectory

mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255, x_test / 255

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax'),
])

with TemporaryDirectory() as dirpath:
    hook = smd.KerasHook(out_dir=dirpath)

    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    model.fit(x_train, y_train, epochs=5, callbacks=[hook])

    trial = smd.create_trial(path=dirpath)
    print(hook)
    print(trial)

gives the following output:

<smdebug.tensorflow.keras.KerasHook object at 0x1025aaed0>:(
    out_dir=/var/folders/r1/mgxfss8d45jbs_vl464bbsg906jznv/T/tmpdzybvlqg,
    tensorboard_dir=None,
    step=9374,
    mode=ModeKeys.TRAIN,
    mode_steps={<ModeKeys.GLOBAL: 4>: 9374, <ModeKeys.TRAIN: 1>: 9374},
    include_collections=['metrics', 'losses', 'sm_metrics'],
    writer=None,
    save_config=<class SaveConfig: {<ModeKeys.TRAIN: 1>: <class SaveConfig: save_interval=500, save_steps=[], start_step=0, end_step=None>, <ModeKeys.EVAL: 2>: <class SaveConfig: save_interval=500, save_steps=[], sta ...>,
    reduction_config=<class ReductionConfig: reductions=[], abs_reductions=[], norms=[], abs_norms=[]>,
    save_all=False,
    dry_run=False,
)
<smdebug.trials.local_trial.LocalTrial object at 0x1025b0f50>:(
    name=tmpdzybvlqg,
    path=/var/folders/r1/mgxfss8d45jbs_vl464bbsg906jznv/T/tmpdzybvlqg,
    steps=[0, 500, 1000, 1500, 1874, 2000, 2500, 3000, 3500, 3749, 4000, 4500, 5000, 5500, 5624, 6000, 6500, 7000, 7499, 7500, 8000, 8500, 9000, 9374],
    collections=['default', 'weights', 'biases', 'gradients', 'losses', 'metrics', 'inputs', 'outputs', 'all', 'sm_metrics'],
    tensor_names=['acc', 'batch', 'loss', 'size'],
)

It appears to be saving every 1874th step, in addition to every 500th. Is this desired behavior?

Vikas-kum commented 4 years ago

can you check mode and mode step of saved global steps?

rahul003 commented 4 years ago

This is probably the last step in an epoch. We save additional metrics which Keras only gives us at the end of epoch at that point