[BUG] Error thrown when trying to enable instance metrics for MXNet

humanzz commented 6 years ago

Please fill out the form below.

System Information

Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): MXNet
Framework Version: 1.1
Python Version: 3
CPU or GPU: CPU
Python SDK Version: 1.5.1
Are you using a custom image: No

Describe the problem

The MXNet container by default does not emit instance metrics. When trying to enable it setting enable_cloudwatch_metrics=True, the training job fails with the exception pasted below.

Minimal repro / logs

Command to train (note it uses some of my own python code as it's a custom MXNet model)

    m = MXNet("kamel_estimator.py",
              source_dir='src',
              role=role,
              train_instance_count=1,
              train_instance_type="ml.m4.xlarge",
              sagemaker_session=sagemaker_session,
              py_version="py3",
              base_job_name="kamel-test-20180705",
              enable_cloudwatch_metrics=True,
              output_path=get_config(current_env, 'artifacts_s3_bucket'),
              code_location=get_config(current_env, 'artifacts_s3_bucket'),
              hyperparameters={'some_param': 1)
m.fit('s3://kamel-data/training/data)

Training job CloudWatch log showing the error

2018-07-05 11:30:31,569 INFO - root - running container entrypoint
2018-07-05 11:30:31,570 INFO - root - starting train task
2018-07-05 11:30:31,573 INFO - container_support.training - Training starting
2018-07-05 11:30:31,575 INFO - container_support.environment - starting metrics service
2018-07-05 11:30:31,578 ERROR - container_support.training - uncaught exception during training: [Errno 2] No such file or directory: 'telegraf'
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/container_support/training.py", line 32, in start
env.start_metrics_if_enabled()
File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 124, in start_metrics_if_enabled
subprocess.Popen(['telegraf', '--config', telegraf_conf])
File "/usr/lib/python3.5/subprocess.py", line 947, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.5/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'telegraf'

Notes

I am suspecting the issue is not actually with the Python SDK, but with the container itself not having the necessary library
I am opening the issue here as this is the entry point that exposes the problem
All over the SDK, enable_cloudwatch_metrics is always set to False. This might be due to the knowledge that the containers not supporting it. Otherwise, testing for cases with enable_cloudwatch_metrics=True sound like something that should be added.

iquintero commented 6 years ago

Hi @humanzz thanks for the report! We will look into this.

humanzz commented 6 years ago

Hi @iquintero, I'm seeing some progress as evidenced by your #292. I saw in your change that in the estimator there's now a warning that this is deprecated and that metrics are by default emitted for training jobs. How about for endpoints? I already have an already running endpoint and I can see it's not emitting anything. That was the main purpose I tried to enable the metrics - to monitor the endpoint. Also, related to that, I failed to find any documentation about SAGEMAKER_CONTAINER_LOG_LEVEL and what values are relevant or whether this affects Instance Metrics at all.

iquintero commented 6 years ago

Hi @humanzz

I just checked in one of my endpoints and it does post metrics. On the AWS Console under View Instance Metrics it does seem like there are no metrics:

However if you click under /aws/sagemaker/Endpoints and keep the same endpoint name filter, you will see something similar to:

We can see there are 5 metrics for this endpoint, clicking on it will show all the actual instance metrics:

I will check with our console team to see why the link from the SageMaker console is not correct. However this should hopefully unblock you.

Regarding SAGEMAKER_CONTAINER_LOG_LEVEL, the valid values are whatever the python logger accepts: https://docs.python.org/2/library/logging.html#logging-levels

Hope this helps!

humanzz commented 6 years ago

Thanks @iquintero. I confirm I found the metrics for my endpoints as per your instructions and this definitely unblocks me.

I hope the console team fixes this soon as this is tbh quite confusing.

Thanks!

laurenyu commented 5 years ago

seems like the change has been released for awhile now (and I can see the instance metrics when I look at the AWS console by simply clicking "View instance metrics"), so closing this issue.

aws / sagemaker-python-sdk