aws-samples / aws-parallelcluster-monitoring

Monitoring Dashboard for AWS ParallelCluster
MIT No Attribution
34 stars 24 forks source link

Validation errors #6

Closed afernandezody closed 3 years ago

afernandezody commented 3 years ago

Hello, Maybe there is something wrong in my script

[global]
update_check = true
sanity_check = true
cluster_template = w1cluster

[aws]
aws_region_name = us-east-1
aws_access_key_id = ***
aws_secret_access_key = ***

[cluster w1cluster]
vpc_settings = odyvpc
placement_group = DYNAMIC
placement = compute
key_name = llave_i3
master_instance_type = t3.micro
compute_instance_type = c5.large
cluster_type = spot
disable_hyperthreading = true
initial_queue_size = 2
max_queue_size = 2
maintain_initial_size = true
scheduler = slurm
base_os = alinux2
post_install = https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-monitoring/main/post-install.sh
post_install_args = https://github.com/aws-samples/aws-parallelcluster-monitoring/tarball/main,aws-parallelcluster-monitoring,install-monitoring.sh
additional_iam_policies = arn:aws:iam::aws:policy/CloudWatchFullAccess,arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess,arn:aws:iam::aws:policy/AmazonSSMFullAccess,arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
tags = {“Grafana” : “true”}

[vpc odyvpc]
master_subnet_id = ***
vpc_id = ***

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

because it complains about validation as soon as it starts creating CloudWatchLogsSubstack.

$pcluster create w1cluster
Beginning cluster creation for cluster: w1cluster
Creating stack named: parallelcluster-w1cluster
Status: parallelcluster-w1cluster - ROLLBACK_IN_PROGRESS
Cluster creation failed.  Failed events:
  - AWS::EC2::SecurityGroup MasterSecurityGroup Resource creation cancelled
  - AWS::CloudFormation::Stack CloudWatchLogsSubstack Resource creation cancelled
  - AWS::CloudFormation::Stack EBSCfnStack Resource creation cancelled
  - AWS::EC2::EIP MasterEIP Resource creation cancelled
  - AWS::IAM::Role RootRole 2 validation errors detected: Value '?true?' at 'tags.1.member.value' failed to satisfy constraint: Member must satisfy regular expression pattern: [\p{L}\p{Z}\p{N}_.:/=+\-@]*; Value '?Grafana?' at 'tags.1.member.key' failed to satisfy constraint: Member must satisfy regular expression pattern: [\p{L}\p{Z}\p{N}_.:/=+\-@]+ (Service: AmazonIdentityManagement; Status Code: 400; Error Code: ValidationError; Request ID: 2f882ed6-38fd-4736-9dbb-42a78abc7fe1; Proxy: null)
  - AWS::IAM::Role CleanupResourcesFunctionExecutionRole 2 validation errors detected: Value '?true?' at 'tags.1.member.value' failed to satisfy constraint: Member must satisfy regular expression pattern: [\p{L}\p{Z}\p{N}_.:/=+\-@]*; Value '?Grafana?' at 'tags.1.member.key' failed to satisfy constraint: Member must satisfy regular expression pattern: [\p{L}\p{Z}\p{N}_.:/=+\-@]+ (Service: AmazonIdentityManagement; Status Code: 400; Error Code: ValidationError; Request ID: 4d8a9724-99f5-4d22-bc15-9c29c9f0de5f; Proxy: null)

Thanks.

nicolaven commented 3 years ago

Hi @afernandezody thanks for reaching out.

I believe the quotes you are using in the tag parameters are not the right ones: " vs “ it should be

tags = {"Grafana" : "true"}

Please fix this a let me know if this solves the error.

Thanks

afernandezody commented 3 years ago

That doesn't make any difference.

nicolaven commented 3 years ago

Can you please try to completely remove the tags parameter?

afernandezody commented 3 years ago

Hi @nicolaven, Somehow, I cannot even log in on the master instance after removing the tag. I had to roll it back and couldn't check any logs. In addition to this issue, there's something else that has caught my attention. The postscript file uses post_install_args, which is set to https://github.com/aws-samples/aws-parallelcluster-monitoring/tarball/main,aws-parallelcluster-monitoring,install-monitoring.sh. However, there is no 'tarball' subdirectory as everything looks uncompressed (or maybe the post-install script takes care of this but it doesn't look like that to me). Thanks.

nicolaven commented 3 years ago

Hi @afernandezody if you go to this URL https://github.com/aws-samples/aws-parallelcluster-monitoring/tarball/main with your browser you can see it is actually downloading a tar.gz file. The post-install script is basically downloading that file and un-tar it.

Regarding the tags, I would suggest trying to recreate a new cluster using this configuration file

[global]
update_check = true
sanity_check = true
cluster_template = w1cluster

[aws]
aws_region_name = us-east-1
aws_access_key_id = ***
aws_secret_access_key = ***

[cluster w1cluster]
vpc_settings = odyvpc
placement_group = DYNAMIC
placement = compute
key_name = llave_i3
master_instance_type = t3.micro
compute_instance_type = c5.large
cluster_type = spot
disable_hyperthreading = true
initial_queue_size = 2
max_queue_size = 2
maintain_initial_size = true
scheduler = slurm
base_os = alinux2
post_install = https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-monitoring/main/post-install.sh
post_install_args = https://github.com/aws-samples/aws-parallelcluster-monitoring/tarball/main,aws-parallelcluster-monitoring,install-monitoring.sh
additional_iam_policies = arn:aws:iam::aws:policy/CloudWatchFullAccess,arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess,arn:aws:iam::aws:policy/AmazonSSMFullAccess,arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
tags = {"Grafana" : "true"}

[vpc odyvpc]
master_subnet_id = ***
vpc_id = ***

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
afernandezody commented 3 years ago

I would not have figured it out in a million years. The cluster launched and I was able to open the launch screen on the browser (although I didn't know the grafana password). However, my main issue now is that trying CentOS8 as the OS results in the compute instances being created and terminated in an apparently endless loop. (I had read the comment that only alinux2 has been tested). I went over all the files but didn't find anything outstanding. The only thing that crossed my mind was if somehow the variable 'cfn_cluster_user' is not being gathered correctly. Any thoughts on why it doesn't work with CentOS8. Thanks.

nicolaven commented 3 years ago

Yes, I confirm that this monitoring dashboard has only been tested with AL2. I'd suggest to have a look at the installation logs here: /tmp/monitoring-setup.log and try to figure out whats wrong with Centos8. Most likely it is the installation of the components, here: https://github.com/aws-samples/aws-parallelcluster-monitoring/blob/main/parallelcluster-setup/install-monitoring.sh

Feel free to send a PR with the modification needed.

Thanks

nicolaven commented 3 years ago

Any progress? do you need help?

afernandezody commented 3 years ago

Hi @nicolaven,
It's working for both CentOS7 & 8. The only thing that I haven't tested is with p3 (or other GPU) compute instances but should be no problem as it's only a minor correction.
Best.

afernandezody commented 3 years ago

Closing as the PR fixes my original problem, but let me know if the PR needs any fixing.