aws-samples / aws-parallelcluster-monitoring

Monitoring Dashboard for AWS ParallelCluster
MIT No Attribution
34 stars 24 forks source link
aws aws-parallelcluster gpu grafana-dashboard hpc metrics monitoring parallelcluster slurm

Grafana Dashboard for AWS ParallelCluster

This is a sample solution based on Grafana for monitoring various component of an HPC cluster built with AWS ParallelCluster. There are 6 dashboards that can be used as they are or customized as you need.

Quickstart

Create a cluster using AWS ParallelCluster and include the following configuration:

PC 3.X

Update your cluster's config by adding the following snippet in the HeadNode and Scheduling section:

CustomActions:
  OnNodeConfigured:
    Script: https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-monitoring/main/post-install.sh
    Args:
      - v0.9
Iam:
  AdditionalIamPolicies:
    - Policy: arn:aws:iam::aws:policy/CloudWatchFullAccess
    - Policy: arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess
    - Policy: arn:aws:iam::aws:policy/AmazonSSMFullAccess
    - Policy: arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
Tags:
  - Key: 'Grafana'
    Value: 'true'

See the complete example config: pcluster.yaml.

AWS ParallelCluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool that makes it easy for you to deploy and manage High Performance Computing (HPC) clusters in the AWS cloud. It automatically sets up the required compute resources and a shared filesystem and offers a variety of batch schedulers such as AWS Batch, SGE, Torque, and Slurm.

Solution components

This project is build with the following components:

Note: while almost all components are under the Apache2 license, only Prometheus-Slurm-Exporter is licensed under GPLv3, you need to be aware of it and accept the license terms before proceeding and installing this component.

Example Dashboards

Cluster Overview

ParallelCluster

HeadNode Dashboard

Head Node

ComputeNodes Dashboard

Compute Node List

Logs

Logs

Cluster Cost

Costs

Quickstart

  1. Create a Security Group that allows you to access the HeadNode on Port 80 and 443. In the following example we open the security group up to 0.0.0.0/0 however we highly advise restricting this down further. More information on how to create your security groups can be found here
read -p "Please enter the vpc id of your cluster: " vpc_id
echo -e "creating a security group with $vpc_id..."
security_group=$(aws ec2 create-security-group --group-name grafana-sg --description "Open HTTP/HTTPS ports" --vpc-id ${vpc_id} --output text)
aws ec2 authorize-security-group-ingress --group-id ${security_group} --protocol tcp --port 443 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-id ${security_group} --protocol tcp --port 80 —-cidr 0.0.0.0/0
  1. Create a cluster with the post install script post-install.sh, the Security Group you created above as AdditionalSecurityGroup on the HeadNode, and a few additional IAM Policies. You can find a complete AWS ParallelCluster template here. Please note that, at the moment, the installation script has only been tested using Amazon Linux 2.
CustomActions:
  OnNodeConfigured:
    Script: https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-monitoring/main/post-install.sh
    Args:
      - v0.9
Iam:
  AdditionalIamPolicies:
    - Policy: arn:aws:iam::aws:policy/CloudWatchFullAccess
    - Policy: arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess
    - Policy: arn:aws:iam::aws:policy/AmazonSSMFullAccess
    - Policy: arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
Tags:
  - Key: 'Grafana'
    Value: 'true'
  1. Connect to https://headnode_public_ip or http://headnode_public_ip (all http connections will be automatically redirected to https) and authenticate with the default Grafana password. A landing page will be presented to you with links to the Prometheus database service and the Grafana dashboards.

Login Screen Login Screen

Note: Because of the higher volume of network traffic due to the compute nodes continuously pushing metrics to the HeadNode, in case you expect to run a large scale cluster (hundreds of instances), we would recommend to use an instance type slightly bigger than what you planned for your master node.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.