aws-samples / aws-eda-slurm-cluster

AWS Slurm Cluster for EDA Workloads
MIT No Attribution
28 stars 7 forks source link

reducing number of compute resources to aggressively. #220

Closed gwolski closed 4 months ago

gwolski commented 5 months ago

I'm building a cluster with just nine instance types and certain instances are being culled to "reduce number of CRs" - this is unnecessary as I do not have many compute resources.

Config file has:

InstanceConfig: UseSpot: false NodeCounts:

@TODO: Update the max number of each instance type to configure

  DefaultMaxCount: 10
Include:
  InstanceTypes:
    - m7a.large
    - m7a.xlarge
    - m7a.2xlarge
    - m7a.4xlarge
    - r7a.large
    - r7a.xlarge
    - r7a.2xlarge
    - r7a.4xlarge
    - r7a.8xlarge

It then buckets appropriately: INFO: Instance type by memory and core: INFO: 6 unique memory size: INFO: 8 GB INFO: 1 instance type with 2 core(s): ['m7a.large'] INFO: 16 GB INFO: 1 instance type with 2 core(s): ['r7a.large'] INFO: 1 instance type with 4 core(s): ['m7a.xlarge'] INFO: 32 GB INFO: 1 instance type with 4 core(s): ['r7a.xlarge'] INFO: 1 instance type with 8 core(s): ['m7a.2xlarge'] INFO: 64 GB INFO: 1 instance type with 8 core(s): ['r7a.2xlarge'] INFO: 1 instance type with 16 core(s): ['m7a.4xlarge'] INFO: 128 GB INFO: 1 instance type with 16 core(s): ['r7a.4xlarge'] INFO: 256 GB INFO: 1 instance type with 32 core(s): ['r7a.8xlarge']

But then it starts culling unnecessarily as parallecluster/slurm can handle 9 compute resources...

INFO: Configuring od-8-gb queue: INFO: Adding od-8gb-2-cores compute resource: ['m7a.large'] INFO: Configuring od-16-gb queue: INFO: Adding od-16gb-2-cores compute resource: ['r7a.large'] INFO: Skipping od-16gb-4-cores compute resource: ['m7a.xlarge'] to reduce number of CRs. INFO: Configuring od-32-gb queue: INFO: Adding od-32gb-4-cores compute resource: ['r7a.xlarge'] INFO: Skipping od-32gb-8-cores compute resource: ['m7a.2xlarge'] to reduce number of CRs. INFO: Configuring od-64-gb queue: INFO: Adding od-64gb-8-cores compute resource: ['r7a.2xlarge'] INFO: Skipping od-64gb-16-cores compute resource: ['m7a.4xlarge'] to reduce number of CRs. INFO: Configuring od-128-gb queue: INFO: Adding od-128gb-16-cores compute resource: ['r7a.4xlarge'] INFO: Configuring od-256-gb queue: INFO: Adding od-256gb-32-cores compute resource: ['r7a.8xlarge'] INFO: Created 6 queues with 6 compute resources

I would like to have a 16 core 64G machine, a 32G 8 core machine, etc.. How to disable/modify this "culling". I would argue we should only start culling when we exceed what parallelcluster can handle.

We can now have 50 slurm queues per cluster, and 50 compute resources per queue and 50 compute resources per cluster! See: https://docs.aws.amazon.com/parallelcluster/latest/ug/configuration-of-multiple-queues-v3.html

gwolski commented 5 months ago

I've found by comment out the following three lines in source/cdk/cdk_slurm_stack.py I could turn off the reduction code (at line 2770 in the code I have):

                    if len(parallel_cluster_queue['ComputeResources']):
                        logger.info(f"    Skipping {compute_resource_name:18} compute resource: {instance_types} to reduce number of CRs.")
                        continue

The next line checks if I've exceed the MAX_NUMBER_OF_COMPUTE_RESOURCES, so there is a nice check in case my configuration were to be too much.

I want to be able to have machines with the same cores and less memory - no need to pay for more than I need.

cartalla commented 4 months ago

I was trying to configure as many instance types as allowed by ParallelCluster's limits, but in retrospect, should really leave this up to the user to configure.

I've changed the code to just create 1 instance type per CR and 1 CR per queue/partition. This should allow you to pick and choose which instances. It is now an error if you configure too many instance types and you must either remove included instances or exclude instances until you get under the ParallelCluster limit of 50.