aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
832 stars 312 forks source link

ParallelCluster 3.10.1 fails to setup accounting for slurm cluster (port 6819 unreachable) #6398

Closed ElDeveloper closed 2 months ago

ElDeveloper commented 2 months ago

Required Info:

Region: us-east-1
Image:
  Os: rhel8
HeadNode:
  InstanceType: t2.large
  Networking:
    SubnetId: subnet-xxxxxx
  Ssh:
    KeyName: personal-login
  Iam:
    S3Access:
      - BucketName: xxxxxx
        EnableWriteAccess: true
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonS3FullAccess
      - Policy: arn:aws:iam::aws:policy/SecretsManagerReadWrite

Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: mainq
      ComputeResources:
        - Name: c52xlarge
          Instances:
            - InstanceType: c5.2xlarge
          MinCount: 0
          MaxCount: 32
        - Name: c5xlarge
          Instances:
            - InstanceType: c5.xlarge
          MinCount: 0
          MaxCount: 32
        - Name: r5a4xlarge
          Instances:
            - InstanceType: r5a.4xlarge
          MinCount: 0
          MaxCount: 2
      Networking:
        SubnetIds:
          - subnet-xxxx
      Iam:
        S3Access:
          - BucketName: xxxxxxxxx
            EnableWriteAccess: true

SharedStorage:
  - MountDir: "/scratch"
    Name: scratch
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200
      DeploymentType: SCRATCH_1
{
  "creationTime": "2024-08-15T23:54:47.078Z",
  "headNode": {
    "launchTime": "2024-08-16T00:04:13.000Z",
    "instanceId": "i-xxxxx",
    "publicIpAddress": "xxxxx",
    "instanceType": "t2.large",
    "state": "running",
    "privateIpAddress": "xxxxx"
  },
  "version": "3.10.1",
  "clusterConfiguration": {
    "url": "xxxxxxx"
  },
  "tags": [
    {
      "value": "3.10.1",
      "key": "parallelcluster:version"
    },
    {
      "value": "clstr-a39",
      "key": "parallelcluster:cluster-name"
    }
  ],
  "cloudFormationStackStatus": "CREATE_COMPLETE",
  "clusterName": "clstr-a39",
  "computeFleetStatus": "STOPPING",
  "cloudformationStackArn": "xxxxxxx",
  "lastUpdatedTime": "2024-08-15T23:54:47.078Z",
  "region": "us-east-1",
  "clusterStatus": "CREATE_COMPLETE",
  "scheduler": {
    "type": "slurm"
  }
}

Bug description and how to reproduce: The configuration file listed above lets me successfully create a cluster, however when I add:

  SlurmSettings:
    Database:
      Uri: xxxxxx.rds.amazonaws.com:3306
      UserName: admin
      PasswordSecretArn: arn:aws:secretsmanager:xxxxxxxxx

The commands below fail to setup the accounting:

pcluster update-compute-fleet --cluster-name clstr-a39--status STOP_REQUESTED
pcluster update-cluster -n clstr-a39 -c auto.yaml

From reviewing the logs, the errors that show up are:

sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:ip-10-0-0-25:6819: Connection refused
sacctmgr: error: Sending PersistInit msg: Connection refused

Please be sure to attach the following logs: cfn-init.log chef-client.log completed.log

ElDeveloper commented 2 months ago

The problem was the password contained a # character and slurm was failing to use the correct string. So instead of using pass#word it was only using pass and failing to connect to the database. FWIW, Secrets Manager or RDS autogenerated that password.