Cluster Stopped Appearing in Lists for CLI/UI

mission-coliveros commented 6 months ago

Bug description and how to reproduce: We've had a production cluster up and running for about 4 months now, and while it's online, and able to scale up nodes and run jobs, for some reason it has stopped appearing when running list commands via either the ParallelCluster UI, or the CLI. Strangely enough, the describe-cluster command still works

Additional context: Any other context about the problem. E.g.:

We created the cluster via this repo
The UI and clusters are both on version 3.7.0
We've deleted/rebuilt the UI stack with no change in outcome
For the first few months, this cluster was appearing as expected in both the CLI, and UI. We were also able to update it. We're not sure if we're able to update the cluster now, but I'm guessing we most likely can't, because the Lambda that either deploys updates the cluster runs the list-clusters command. I think that it will try to create a new cluster under the same name.
We have a case open with premium support, but they have not been helpful so far
We created a new test cluster, which appears in both the UI/CLI. The tags on each ParallelCluster CF stack are consistent with one another, it doesn't seem like any tags were removed/modified on the problem cluster

The commands we're running are: pcluster list-clusters -r us-east-1

{
  "nextToken": "CdHr0bQ/U0fy83/BwVnyL9LqOEUrV4yKxUVyQ3+vedhjbjKGcjgb/0Fri+r8MmVE0dhKWa1qFduPGiQTX2KDhN0BWMVXNb9DGBBy92OjS/MzGVj54NrtmMfhQGt5RbfOPXKBdvbqW0to7f+osmQ8rlil+OZmhGLNhbDJ48pwOEM2NB06/rpQldOvGFE8psEMPQgyAmBTVUeFwhessL5oOOVxnrUXB9dPYMirCAG2/flYAyyxB85mtCbntRyihm3iah/SDgtwa7WUgy6KQHGV7wt3K29M09SQ+rWDL/XMOmjkLeSAdyzMi5PE5jHqN/nHaNzfAYGjnRBcIhvuX2d7O/gC3bHWgfhYB4aVwSn0Wf5hhl6cifbtd/1qCA6AM4rIageNFpiXfVJnSzXpUPr5GcOo+qYEbgvhehQfvVOJXD1Zoryvsb+9ytE8rCfNxkr5l1+dKi8KfJ5DDAbBuR3oV96mNGHvlot7CcK0hH/SLL3T/m6ZGfYtzPmr8Aq1FqhOTlNXMIc4hEG3Awww+OBVWjM957reFcdT4Vp0sfUQgmZ8HdK04mJRElGIrovd1iJVQkQtfWsp9VRNMyIpihgtLWCgyVIehuKFahpHfYLE43ue19nwdFNeD+WGKilZtWaTY8q5woGsvMhZF7JnGAwpl1WPNmesO8Hz5ZSFOK4VkJGt6S+kxjKo3LbVfjy+C1WQp1Glg0v2j8PCSC983poUicdNu+C9vrN/gYGGjQ+ra6E=|jjlYHUJmavcc50EvL9gG/w==|1|3616c0329540075297171e15d3a1024b42d63fb8ead272a5efe4914d5e42dafe",
  "clusters": [
    {
      "clusterName": "test",
      "cloudformationStackStatus": "CREATE_COMPLETE",
      "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:061785347060:stack/test/b5afda70-b586-11ee-9649-127bd0c973a3",
      "region": "us-east-1",
      "version": "3.7.0",
      "clusterStatus": "CREATE_COMPLETE",
      "scheduler": {
        "type": "slurm"
      }
    }
  ]
}

pcluster describe-cluster -r us-east-1 --cluster-name test

{
  "creationTime": "2024-01-17T22:21:07.571Z",
  "headNode": {
    "launchTime": "2024-01-17T22:26:07.000Z",
    "instanceId": "i-0d726a63fda0fd34b",
    "instanceType": "t2.micro",
    "state": "running",
    "privateIpAddress": "10.100.22.199"
  },
  "version": "3.7.0",
  "clusterConfiguration": {
    "url": "https://parallelcluster-eb3f5552b3b0a354-v1-do-not-delete.s3.amazonaws.com/parallelcluster/3.7.0/clusters/test-d64qriya2pwa9wk0/configs/cluster-config.yaml?versionId=QCnH2W2YCK1dsPOOgbkf5M7q3iznCS99&AWSAccessKeyId=ASIAQ4YVRSP2HSAQGCFF&Signature=6n3fsg%2FRMyhVOJyg56oP9CQTpCQ%3D&x-amz-security-token=FwoGZXIvYXdzEE0aDBdm1rbxsAsbIC7zQiL5AUUMDJZEbrXbbKnh21A3JXM55qtAKCQIFch3TVpqvBl6YrsJKW19UuT9vUiFTTBad9o665poEHMrJySGWE579K8NVkxvJknpIpWHHhRES6UiOEUo294D3e7k2KnizGK7O0Wb0kvETb%2FTuvpTAmz2nDUz8nrhEgpDifV3cd%2BXRSr2GxkksMUx4DFDEfR7RihF5%2BTX4hBc879xXTJOswp8mrOyWPSrBIoGPYniGyZQ2caaqV10uOupsT9RaB4fDPIzLl4do%2FbUPcMSQXUKVR7tWImpweghmAoF8KtveuxPkcOYGYbxl8z2w%2B%2F2fszXTAAFCueHHdiFYlPdOCjn54ivBjIrtcukK1CGpHHGApp99ozdVQIKmTTGpkXfFtygRk1Grs8q5uF%2BroZquC800g%3D%3D&Expires=1709333362"
  },
  "tags": [
    {
      "value": "3.7.0",
      "key": "parallelcluster:version"
    },
    {
      "value": "test",
      "key": "parallelcluster:cluster-name"
    },
    {
      "value": "true",
      "key": "parallelcluster-ui"
    }
  ],
  "cloudFormationStackStatus": "CREATE_COMPLETE",
  "clusterName": "test",
  "computeFleetStatus": "RUNNING",
  "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:061785347060:stack/test/b5afda70-b586-11ee-9649-127bd0c973a3",
  "lastUpdatedTime": "2024-01-17T22:21:07.571Z",
  "region": "us-east-1",
  "clusterStatus": "CREATE_COMPLETE",
  "scheduler": {
    "type": "slurm"
  }
}

pcluster describe-cluster -r us-east-1 --cluster-name <REDACTED_NAME_OF_PROBLEM_CLUSTER>

{
  "creationTime": "2023-10-07T00:38:18.206Z",
  "headNode": {
    "launchTime": "2023-10-07T00:42:54.000Z",
    "instanceId": "<REDACTED>",
    "instanceType": "c6in.xlarge",
    "state": "running",
    "privateIpAddress": "10.100.15.170"
  },
  "version": "3.7.0",
  "clusterConfiguration": {
    "url": "<REDACTED>"
  },
  "tags": [
    {
      "value": "3.7.0",
      "key": "parallelcluster:version"
    },
    {
      "value": "<REDACTED>",
      "key": "parallelcluster:cluster-name"
    },
    {
      "value": "true",
      "key": "parallelcluster-ui"
    }
  ],
  "cloudFormationStackStatus": "UPDATE_COMPLETE",
  "clusterName": "<REDACTED>",
  "computeFleetStatus": "RUNNING",
  "cloudformationStackArn": "<REDACTED>",
  "lastUpdatedTime": "2023-12-14T01:14:13.238Z",
  "region": "us-east-1",
  "clusterStatus": "UPDATE_COMPLETE",
  "scheduler": {
    "type": "slurm"
  }
}

davprat commented 6 months ago

ParallelCluster uses the CloudFormation API describe-stacks via the boto3 package. If you have a lot of CloudFormation stacks in your account, this can cause the response from that API to exceed 1MB. Once the response exceeds 1MB, the API will start paging the results. This is why you see the nextToken property returned in the list-clusters response. Because this API doesn't support server side filtering, ParallelCluster must filter on the client side - if the cluster stacks are not returned in a certain 1MB page, you may see an empty response for that page.

The solution is to use the nextToken property value from the respoinse and pass it to a new request: pcluster list-clusters --next-token NEXT_TOKEN. Repeat this process until nextToken is null.

davprat commented 6 months ago

I am also told the issue with PCUI should be fixed in a future release.

mission-coliveros commented 6 months ago

ParallelCluster uses the CloudFormation API describe-stacks via the boto3 package. If you have a lot of CloudFormation stacks in your account, this can cause the response from that API to exceed 1MB. Once the response exceeds 1MB, the API will start paging the results. This is why you see the nextToken property returned in the list-clusters response. Because this API doesn't support server side filtering, ParallelCluster must filter on the client side - if the cluster stacks are not returned in a certain 1MB page, you may see an empty response for that page.

The solution is to use the nextToken property value from the respoinse and pass it to a new request: pcluster list-clusters --next-token NEXT_TOKEN. Repeat this process until nextToken is null.

Thanks, we had already determined this issue a few days ago and implemented a workaround.

However, it seems like this logic could be handled from within the ParallelCluster codebase, to handle the pagination of the CloudFormation response

judysng commented 5 months ago

Hi, we have released a new version of PCUI including the fix for displaying all the clusters. You can deploy PCUI with the new version and it should properly display your clusters

aws / aws-parallelcluster

Cluster Stopped Appearing in Lists for CLI/UI #6141