aws-samples / aws-eda-slurm-cluster

AWS Slurm Cluster for EDA Workloads
MIT No Attribution
23 stars 7 forks source link

[BUG] Deploy Lambda Not Accounting for `list-clusters` pagination #213

Closed mission-coliveros closed 3 months ago

mission-coliveros commented 4 months ago

Describe the bug Behind the scenes of the ParallelCluster API, if you call the list-clusters command, it's basically just running aws cloudformation list-stacks. If you have too many CloudFormation stacks in your account, there isn't any logic to handle a paginated response (or a query argument to only get stacks with the tag of parallelcluster:cluster-name), so you have to pass in the CF nextToken yourself if you want to see all your clusters. As a consequence of this, when updating an existing cluster with this repo, the Lambda function might not see the target cluster, since it's not handling pagination (which to be fair, it shouldn't have to do in the first place)

Additional context Here's an example of how you might revise the get_cluster_status function in the CreateCluster Lambda (I performed some testing on my end and this seems to work):

def get_clusters(cluster_region):
    clusters = []
    kwargs_dict = {"region": cluster_region}

    while kwargs_dict:

        clusters_dict = pc.list_clusters(**kwargs_dict)

        if "nextToken" in clusters_dict:
            kwargs_dict["next_token"] = clusters_dict["nextToken"]
        else:
            kwargs_dict = None

        for cluster in clusters_dict['clusters']:
            clusters.append(cluster)

    return clusters

def get_cluster_status(cluster_name, cluster_region):
    logger.info("Listing clusters to get cluster status")
    cluster_status = None
    for cluster_dict in get_clusters(cluster_region):
        if cluster_dict['clusterName'] != cluster_name:
            continue
        logger.info(f"cluster_dict:\n{json.dumps(cluster_dict, indent=4)}")
        cluster_status = cluster_dict['clusterStatus']
        cluster_cloudformation_status = cluster_dict['cloudformationStackStatus']
        logger.info(
            f"{cluster_name} exists. Status={cluster_status} and cloudformation status={cluster_cloudformation_status}")
        break

    return cluster_status
cartalla commented 4 months ago

Thanks for reporting the problem. I will work on pushing a fix soon.

cartalla commented 3 months ago

I have merged this change with PR #203