aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
9.98k stars 6.74k forks source link

How to enable tensorboard to real-time monitor model training performance? #208

Closed ThisIsRick closed 6 years ago

ThisIsRick commented 6 years ago

I'm running train on Sagemaker with a docker image which includes my own algorithm container, following wiki [1]. How to enable tensorboard to real-time monitor model training performance? My model is based on Keras with tensorflow in backend.

Thanks!

[1] https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb

djarpin commented 6 years ago

Thanks @ThisIsRick . Have you tried using SageMaker's pre-built TensorFlow container for your task? There's an example notebook here which shows how to use TensorBoard with it. There are some intricacies with writing checkpoints to S3 and running TensorBoard locally that may make this more difficult to implement in your own container. Thanks.

ThisIsRick commented 6 years ago

Thanks @djarpin. I didn't try with SageMaker's pre-built TensorFlow container. My understanding, the model script has to follow the pattern in order to use pre-built TensorFlow container, right? But, our model script doesn't, it is provided by applied scientist.

We're also considering to keep syncing checkpoints to S3 in container, and have another thread in local to sync checkpoints from S3. But our training job is scheduled by aws command line in local desktop, we don't use notebook instance on Sagemaker. So, this makes syncing checkpoints from S3 part a bit more complicated.

winstonaws commented 6 years ago

@ThisIsRick

The approach you described is the right one. You need your code inside the container to save checkpoints to S3, and you need to periodically sync your local Tensorboard log directory with your S3 checkpoints.

Here is our implementation in the SageMaker Python SDK: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/estimator.py#L29

Are there any specific questions you have about this approach?

djarpin commented 6 years ago

Closing this issue for now, feel free to re-open if your run into more problems with this. Thanks.

elgalu commented 5 years ago

Hi @winstonaws the link you posted was pointing to master branch so the line doesn't match anymore, could you use a commit id instead?

ChoiByungWook commented 5 years ago

@elgalu, I believe @winstonaws was pointing to https://github.com/aws/sagemaker-python-sdk/blob/8a3dea24f04a81b06df35a1c7aa262f6a1a02bb5/src/sagemaker/tensorflow/estimator.py#L29

The most up to date as of now would be: https://github.com/aws/sagemaker-python-sdk/blob/cecea123d4933baa8998afd138fee3eaf28a8e49/src/sagemaker/tensorflow/estimator.py#L46

Otherwise if any of those links are out of date, he is speaking of the TensorBoard class in estimator.py within src/sagemaker/tensorflow.

elgalu commented 3 years ago

from sagemaker.debugger import TensorBoardOutputConfig can also be useful https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_debugger.html#capture-real-time-tensorboard-data-from-the-debugging-hook