aws-samples / sagemaker-studio-lifecycle-config-examples

MIT No Attribution
78 stars 51 forks source link

Error when using CW Agent on KernelApp #9

Open cabral1888 opened 2 years ago

cabral1888 commented 2 years ago

Hi all,

I am new to Sagemaker Studio and I was wondering if there is a way to monitor the studio usage, like, how many machines are being used, how much RAM and CPU the users are using. I've seen another repo of examples from notebook-lifecycle-config-examples (https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples) and I saw a very interesting lifecycle configuration: publish-instance-metrics.

I tried to reproduce this notebook-lifecycle-configuration inside studio-lifecycle-configuration, but no success. Here is my studio lifecycle configuration:

#!/bin/bash

set -e

# OVERVIEW
# This script publishes the system-level metrics from the Notebook instance to Cloudwatch.
#
# Note that this script will fail if either condition is not met
#   1. Ensure the Notebook Instance has internet connectivity to fetch the example config
#   2. Ensure the Notebook Instance execution role permissions to cloudwatch:PutMetricData to publish the system-level metrics
#
# https://aws.amazon.com/cloudwatch/pricing/
apt-get update
apt-get -y install jq

# PARAMETERS
NOTEBOOK_INSTANCE_NAME=$(jq '.ResourceName' /opt/ml/metadata/resource-metadata.json --raw-output)

echo "Fetching the CloudWatch agent configuration file."
wget https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/master/scripts/publish-instance-metrics/amazon-cloudwatch-agent.json

sed -i -- "s/MyNotebookInstance/$NOTEBOOK_INSTANCE_NAME/g" amazon-cloudwatch-agent.json

echo "Starting the CloudWatch agent on the Notebook Instance."
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file://$(pwd)/amazon-cloudwatch-agent.json -s

In order to reproduce and try to understand what happened, I decided to use a terminal tab inside Sagemaker Studio and run the commands one by one and see what happens. The last command gave me the following output:

/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl: 469: /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl: systemctl: not found
unknown init system

I don't know if there is anything I'm missing, or if it isn't supported yet by sagemaker studio. Can you please help me on this issue?

P.S.: I'm using a Kernel with Python3 and Data Science docker image.

cabral1888 commented 2 years ago

Anyone?

leezero-carbon commented 9 months ago

@cabral1888 I have struggled with this recently and it's worth understanding the difference between SageMaker Notebook documentation and SageMaker Studio.

In the AWS docs it's sometimes hard (because they lack clarity) to understand which bit they are talking about.

Anyway the code you mention above is for running in an NoteBook instance not a Studio KernelGateway instance running your notebook. In the KernelGateway there is no init system and you shouldn't try and make one as that kinda breaks the idea of running Docker.

Instead... have a look into running a metric gathering process that's not reliant on an init system and can be spawned into the background e.g. nohup custom-tool --config /opt/custom-tool/config.conf > /opt/custom-tool/output.log 2>&1 &

For me...I used telegraf with a CloudWatch output plugin