Open mchoi8739 opened 3 years ago
After debugging the output model parameters, accuracy (or loss) is not improving at all. Need to investigate this non-converging issue.
Debugger hook config added:
from sagemaker.debugger import DebuggerHookConfig
hook_config = DebuggerHookConfig(
hook_parameters={
"train.save_interval": "10"
}
)
querying output tensors:
import smdebug
from smdebug.trials import create_trial
trial = create_trial(estimator.latest_job_debugger_artifacts_path())
trial.tensor("accuracy").values(smdebug.modes.TRAIN)
returns:
{0: array([0.59765625], dtype=float32),
10: array([0.5269886], dtype=float32),
20: array([0.51023066], dtype=float32),
30: array([0.50491434], dtype=float32),
40: array([0.5074314], dtype=float32),
50: array([0.50375307], dtype=float32),
60: array([0.50390625], dtype=float32),
70: array([0.50099033], dtype=float32),
80: array([0.501495], dtype=float32),
90: array([0.49853516], dtype=float32),
100: array([0.5061947], dtype=float32),
110: array([0.51073444], dtype=float32),
120: array([0.50228214], dtype=float32),
130: array([0.50040984], dtype=float32),
140: array([0.49925473], dtype=float32),
150: array([0.49965358], dtype=float32),
160: array([0.49894366], dtype=float32),
170: array([0.496875], dtype=float32),
180: array([0.5033854], dtype=float32),
190: array([0.5046875], dtype=float32),
200: array([0.49977458], dtype=float32),
210: array([0.5003499], dtype=float32),
220: array([0.4997856], dtype=float32),
230: array([0.5025374], dtype=float32),
240: array([0.50193596], dtype=float32),
250: array([0.44140625], dtype=float32),
260: array([0.5019531], dtype=float32),
270: array([0.49698153], dtype=float32),
280: array([0.4967041], dtype=float32),
290: array([0.4971168], dtype=float32),
300: array([0.5003025], dtype=float32),
310: array([0.49987328], dtype=float32),
320: array([0.49912778], dtype=float32),
330: array([0.49990433], dtype=float32),
340: array([0.50130206], dtype=float32),
350: array([0.49917763], dtype=float32),
360: array([0.50363684], dtype=float32),
370: array([0.50130206], dtype=float32),
380: array([0.50063777], dtype=float32),
390: array([0.49867585], dtype=float32),
400: array([0.497269], dtype=float32),
410: array([0.49607667], dtype=float32)}
what the link to the notebook?
updated the link
@mchoi8739 does debugger team has a github repo? You can link this question on their Issues
@mchoi8739 , in
from sagemaker.debugger import DebuggerHookConfig
hook_config = DebuggerHookConfig(
hook_parameters={
"train.save_interval": "10"
}
)
do you know what key value pairs hook_parameters
take?
Okay, then
hook_parameters={
"train.save_interval": "10"
}
means every 10 steps the training program sends model parameters to debugger, it is too "frequent", that's why the training takes so long.
https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow_nlp_sentiment_analysis
Since the notebook published with the public SDK version, the training job runs for over an hour with epoch=25. It's expected to run for ~500 seconds.