[Debugger] tensorflow_nlp_sentiment_analysis training job runs much longer than expected

mchoi8739 commented 3 years ago

https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow_nlp_sentiment_analysis

Since the notebook published with the public SDK version, the training job runs for over an hour with epoch=25. It's expected to run for ~500 seconds.

mchoi8739 commented 3 years ago

After debugging the output model parameters, accuracy (or loss) is not improving at all. Need to investigate this non-converging issue.

Debugger hook config added:

from sagemaker.debugger import DebuggerHookConfig
hook_config = DebuggerHookConfig(
    hook_parameters={
        "train.save_interval": "10"
    }
)

querying output tensors:

import smdebug
from smdebug.trials import create_trial
trial = create_trial(estimator.latest_job_debugger_artifacts_path())
trial.tensor("accuracy").values(smdebug.modes.TRAIN)

returns:

{0: array([0.59765625], dtype=float32),
 10: array([0.5269886], dtype=float32),
 20: array([0.51023066], dtype=float32),
 30: array([0.50491434], dtype=float32),
 40: array([0.5074314], dtype=float32),
 50: array([0.50375307], dtype=float32),
 60: array([0.50390625], dtype=float32),
 70: array([0.50099033], dtype=float32),
 80: array([0.501495], dtype=float32),
 90: array([0.49853516], dtype=float32),
 100: array([0.5061947], dtype=float32),
 110: array([0.51073444], dtype=float32),
 120: array([0.50228214], dtype=float32),
 130: array([0.50040984], dtype=float32),
 140: array([0.49925473], dtype=float32),
 150: array([0.49965358], dtype=float32),
 160: array([0.49894366], dtype=float32),
 170: array([0.496875], dtype=float32),
 180: array([0.5033854], dtype=float32),
 190: array([0.5046875], dtype=float32),
 200: array([0.49977458], dtype=float32),
 210: array([0.5003499], dtype=float32),
 220: array([0.4997856], dtype=float32),
 230: array([0.5025374], dtype=float32),
 240: array([0.50193596], dtype=float32),
 250: array([0.44140625], dtype=float32),
 260: array([0.5019531], dtype=float32),
 270: array([0.49698153], dtype=float32),
 280: array([0.4967041], dtype=float32),
 290: array([0.4971168], dtype=float32),
 300: array([0.5003025], dtype=float32),
 310: array([0.49987328], dtype=float32),
 320: array([0.49912778], dtype=float32),
 330: array([0.49990433], dtype=float32),
 340: array([0.50130206], dtype=float32),
 350: array([0.49917763], dtype=float32),
 360: array([0.50363684], dtype=float32),
 370: array([0.50130206], dtype=float32),
 380: array([0.50063777], dtype=float32),
 390: array([0.49867585], dtype=float32),
 400: array([0.497269], dtype=float32),
 410: array([0.49607667], dtype=float32)}

hongshanli23 commented 3 years ago

what the link to the notebook?

mchoi8739 commented 3 years ago

updated the link

hongshanli23 commented 3 years ago

@mchoi8739 does debugger team has a github repo? You can link this question on their Issues

hongshanli23 commented 3 years ago

@mchoi8739 , in

from sagemaker.debugger import DebuggerHookConfig
hook_config = DebuggerHookConfig(
    hook_parameters={
        "train.save_interval": "10"
    }
)

do you know what key value pairs hook_parameters take?

mchoi8739 commented 3 years ago

Yes, I listed them here: https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig

hongshanli23 commented 3 years ago

Okay, then

  hook_parameters={
        "train.save_interval": "10"
    }

means every 10 steps the training program sends model parameters to debugger, it is too "frequent", that's why the training takes so long.

aws / amazon-sagemaker-examples

[Debugger] tensorflow_nlp_sentiment_analysis training job runs much longer than expected #1957