aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
826 stars 312 forks source link

Cluster Stopped Appearing in Lists for CLI/UI #6141

Closed mission-coliveros closed 5 months ago

mission-coliveros commented 6 months ago

Bug description and how to reproduce: We've had a production cluster up and running for about 4 months now, and while it's online, and able to scale up nodes and run jobs, for some reason it has stopped appearing when running list commands via either the ParallelCluster UI, or the CLI. Strangely enough, the describe-cluster command still works

Additional context: Any other context about the problem. E.g.:

The commands we're running are: pcluster list-clusters -r us-east-1

{
  "nextToken": "CdHr0bQ/U0fy83/BwVnyL9LqOEUrV4yKxUVyQ3+vedhjbjKGcjgb/0Fri+r8MmVE0dhKWa1qFduPGiQTX2KDhN0BWMVXNb9DGBBy92OjS/MzGVj54NrtmMfhQGt5RbfOPXKBdvbqW0to7f+osmQ8rlil+OZmhGLNhbDJ48pwOEM2NB06/rpQldOvGFE8psEMPQgyAmBTVUeFwhessL5oOOVxnrUXB9dPYMirCAG2/flYAyyxB85mtCbntRyihm3iah/SDgtwa7WUgy6KQHGV7wt3K29M09SQ+rWDL/XMOmjkLeSAdyzMi5PE5jHqN/nHaNzfAYGjnRBcIhvuX2d7O/gC3bHWgfhYB4aVwSn0Wf5hhl6cifbtd/1qCA6AM4rIageNFpiXfVJnSzXpUPr5GcOo+qYEbgvhehQfvVOJXD1Zoryvsb+9ytE8rCfNxkr5l1+dKi8KfJ5DDAbBuR3oV96mNGHvlot7CcK0hH/SLL3T/m6ZGfYtzPmr8Aq1FqhOTlNXMIc4hEG3Awww+OBVWjM957reFcdT4Vp0sfUQgmZ8HdK04mJRElGIrovd1iJVQkQtfWsp9VRNMyIpihgtLWCgyVIehuKFahpHfYLE43ue19nwdFNeD+WGKilZtWaTY8q5woGsvMhZF7JnGAwpl1WPNmesO8Hz5ZSFOK4VkJGt6S+kxjKo3LbVfjy+C1WQp1Glg0v2j8PCSC983poUicdNu+C9vrN/gYGGjQ+ra6E=|jjlYHUJmavcc50EvL9gG/w==|1|3616c0329540075297171e15d3a1024b42d63fb8ead272a5efe4914d5e42dafe",
  "clusters": [
    {
      "clusterName": "test",
      "cloudformationStackStatus": "CREATE_COMPLETE",
      "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:061785347060:stack/test/b5afda70-b586-11ee-9649-127bd0c973a3",
      "region": "us-east-1",
      "version": "3.7.0",
      "clusterStatus": "CREATE_COMPLETE",
      "scheduler": {
        "type": "slurm"
      }
    }
  ]
}

pcluster describe-cluster -r us-east-1 --cluster-name test

{
  "creationTime": "2024-01-17T22:21:07.571Z",
  "headNode": {
    "launchTime": "2024-01-17T22:26:07.000Z",
    "instanceId": "i-0d726a63fda0fd34b",
    "instanceType": "t2.micro",
    "state": "running",
    "privateIpAddress": "10.100.22.199"
  },
  "version": "3.7.0",
  "clusterConfiguration": {
    "url": "https://parallelcluster-eb3f5552b3b0a354-v1-do-not-delete.s3.amazonaws.com/parallelcluster/3.7.0/clusters/test-d64qriya2pwa9wk0/configs/cluster-config.yaml?versionId=QCnH2W2YCK1dsPOOgbkf5M7q3iznCS99&AWSAccessKeyId=ASIAQ4YVRSP2HSAQGCFF&Signature=6n3fsg%2FRMyhVOJyg56oP9CQTpCQ%3D&x-amz-security-token=FwoGZXIvYXdzEE0aDBdm1rbxsAsbIC7zQiL5AUUMDJZEbrXbbKnh21A3JXM55qtAKCQIFch3TVpqvBl6YrsJKW19UuT9vUiFTTBad9o665poEHMrJySGWE579K8NVkxvJknpIpWHHhRES6UiOEUo294D3e7k2KnizGK7O0Wb0kvETb%2FTuvpTAmz2nDUz8nrhEgpDifV3cd%2BXRSr2GxkksMUx4DFDEfR7RihF5%2BTX4hBc879xXTJOswp8mrOyWPSrBIoGPYniGyZQ2caaqV10uOupsT9RaB4fDPIzLl4do%2FbUPcMSQXUKVR7tWImpweghmAoF8KtveuxPkcOYGYbxl8z2w%2B%2F2fszXTAAFCueHHdiFYlPdOCjn54ivBjIrtcukK1CGpHHGApp99ozdVQIKmTTGpkXfFtygRk1Grs8q5uF%2BroZquC800g%3D%3D&Expires=1709333362"
  },
  "tags": [
    {
      "value": "3.7.0",
      "key": "parallelcluster:version"
    },
    {
      "value": "test",
      "key": "parallelcluster:cluster-name"
    },
    {
      "value": "true",
      "key": "parallelcluster-ui"
    }
  ],
  "cloudFormationStackStatus": "CREATE_COMPLETE",
  "clusterName": "test",
  "computeFleetStatus": "RUNNING",
  "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:061785347060:stack/test/b5afda70-b586-11ee-9649-127bd0c973a3",
  "lastUpdatedTime": "2024-01-17T22:21:07.571Z",
  "region": "us-east-1",
  "clusterStatus": "CREATE_COMPLETE",
  "scheduler": {
    "type": "slurm"
  }
}

pcluster describe-cluster -r us-east-1 --cluster-name <REDACTED_NAME_OF_PROBLEM_CLUSTER>

{
  "creationTime": "2023-10-07T00:38:18.206Z",
  "headNode": {
    "launchTime": "2023-10-07T00:42:54.000Z",
    "instanceId": "<REDACTED>",
    "instanceType": "c6in.xlarge",
    "state": "running",
    "privateIpAddress": "10.100.15.170"
  },
  "version": "3.7.0",
  "clusterConfiguration": {
    "url": "<REDACTED>"
  },
  "tags": [
    {
      "value": "3.7.0",
      "key": "parallelcluster:version"
    },
    {
      "value": "<REDACTED>",
      "key": "parallelcluster:cluster-name"
    },
    {
      "value": "true",
      "key": "parallelcluster-ui"
    }
  ],
  "cloudFormationStackStatus": "UPDATE_COMPLETE",
  "clusterName": "<REDACTED>",
  "computeFleetStatus": "RUNNING",
  "cloudformationStackArn": "<REDACTED>",
  "lastUpdatedTime": "2023-12-14T01:14:13.238Z",
  "region": "us-east-1",
  "clusterStatus": "UPDATE_COMPLETE",
  "scheduler": {
    "type": "slurm"
  }
}
davprat commented 6 months ago

ParallelCluster uses the CloudFormation API describe-stacks via the boto3 package. If you have a lot of CloudFormation stacks in your account, this can cause the response from that API to exceed 1MB. Once the response exceeds 1MB, the API will start paging the results. This is why you see the nextToken property returned in the list-clusters response. Because this API doesn't support server side filtering, ParallelCluster must filter on the client side - if the cluster stacks are not returned in a certain 1MB page, you may see an empty response for that page.

The solution is to use the nextToken property value from the respoinse and pass it to a new request: pcluster list-clusters --next-token NEXT_TOKEN. Repeat this process until nextToken is null.

davprat commented 6 months ago

I am also told the issue with PCUI should be fixed in a future release.

mission-coliveros commented 6 months ago

ParallelCluster uses the CloudFormation API describe-stacks via the boto3 package. If you have a lot of CloudFormation stacks in your account, this can cause the response from that API to exceed 1MB. Once the response exceeds 1MB, the API will start paging the results. This is why you see the nextToken property returned in the list-clusters response. Because this API doesn't support server side filtering, ParallelCluster must filter on the client side - if the cluster stacks are not returned in a certain 1MB page, you may see an empty response for that page.

The solution is to use the nextToken property value from the respoinse and pass it to a new request: pcluster list-clusters --next-token NEXT_TOKEN. Repeat this process until nextToken is null.

Thanks, we had already determined this issue a few days ago and implemented a workaround.

However, it seems like this logic could be handled from within the ParallelCluster codebase, to handle the pagination of the CloudFormation response

judysng commented 5 months ago

Hi, we have released a new version of PCUI including the fix for displaying all the clusters. You can deploy PCUI with the new version and it should properly display your clusters