Closed ThisIsRick closed 6 years ago
Thanks @ThisIsRick . Have you tried using SageMaker's pre-built TensorFlow container for your task? There's an example notebook here which shows how to use TensorBoard with it. There are some intricacies with writing checkpoints to S3 and running TensorBoard locally that may make this more difficult to implement in your own container. Thanks.
Thanks @djarpin. I didn't try with SageMaker's pre-built TensorFlow container. My understanding, the model script has to follow the pattern in order to use pre-built TensorFlow container, right? But, our model script doesn't, it is provided by applied scientist.
We're also considering to keep syncing checkpoints to S3 in container, and have another thread in local to sync checkpoints from S3. But our training job is scheduled by aws command line in local desktop, we don't use notebook instance on Sagemaker. So, this makes syncing checkpoints from S3 part a bit more complicated.
@ThisIsRick
The approach you described is the right one. You need your code inside the container to save checkpoints to S3, and you need to periodically sync your local Tensorboard log directory with your S3 checkpoints.
Here is our implementation in the SageMaker Python SDK: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/estimator.py#L29
Are there any specific questions you have about this approach?
Closing this issue for now, feel free to re-open if your run into more problems with this. Thanks.
Hi @winstonaws the link you posted was pointing to master branch so the line doesn't match anymore, could you use a commit id instead?
@elgalu, I believe @winstonaws was pointing to https://github.com/aws/sagemaker-python-sdk/blob/8a3dea24f04a81b06df35a1c7aa262f6a1a02bb5/src/sagemaker/tensorflow/estimator.py#L29
The most up to date as of now would be: https://github.com/aws/sagemaker-python-sdk/blob/cecea123d4933baa8998afd138fee3eaf28a8e49/src/sagemaker/tensorflow/estimator.py#L46
Otherwise if any of those links are out of date, he is speaking of the TensorBoard class in estimator.py within src/sagemaker/tensorflow.
from sagemaker.debugger import TensorBoardOutputConfig
can also be useful https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_debugger.html#capture-real-time-tensorboard-data-from-the-debugging-hook
I'm running train on Sagemaker with a docker image which includes my own algorithm container, following wiki [1]. How to enable tensorboard to real-time monitor model training performance? My model is based on Keras with tensorflow in backend.
Thanks!
[1] https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb