Closed humanzz closed 5 years ago
Hi @humanzz thanks for the report! We will look into this.
Hi @iquintero, I'm seeing some progress as evidenced by your #292. I saw in your change that in the estimator there's now a warning that this is deprecated and that metrics are by default emitted for training jobs. How about for endpoints? I already have an already running endpoint and I can see it's not emitting anything. That was the main purpose I tried to enable the metrics - to monitor the endpoint. Also, related to that, I failed to find any documentation about SAGEMAKER_CONTAINER_LOG_LEVEL and what values are relevant or whether this affects Instance Metrics at all.
Hi @humanzz
I just checked in one of my endpoints and it does post metrics. On the AWS Console under View Instance Metrics it does seem like there are no metrics:
However if you click under /aws/sagemaker/Endpoints
and keep the same endpoint name filter, you will see something similar to:
We can see there are 5 metrics for this endpoint, clicking on it will show all the actual instance metrics:
I will check with our console team to see why the link from the SageMaker console is not correct. However this should hopefully unblock you.
Regarding SAGEMAKER_CONTAINER_LOG_LEVEL, the valid values are whatever the python logger accepts: https://docs.python.org/2/library/logging.html#logging-levels
Hope this helps!
Thanks @iquintero. I confirm I found the metrics for my endpoints as per your instructions and this definitely unblocks me.
I hope the console team fixes this soon as this is tbh quite confusing.
Thanks!
seems like the change has been released for awhile now (and I can see the instance metrics when I look at the AWS console by simply clicking "View instance metrics"), so closing this issue.
Please fill out the form below.
System Information
Describe the problem
The MXNet container by default does not emit instance metrics. When trying to enable it setting
enable_cloudwatch_metrics=True
, the training job fails with the exception pasted below.Minimal repro / logs
Command to train (note it uses some of my own python code as it's a custom MXNet model)
Training job CloudWatch log showing the error
Notes
enable_cloudwatch_metrics
is always set toFalse
. This might be due to the knowledge that the containers not supporting it. Otherwise, testing for cases withenable_cloudwatch_metrics=True
sound like something that should be added.