aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
134 stars 57 forks source link

PlacementGroup option for "Capacity Blocks for ML" #353

Open liyier90 opened 3 weeks ago

liyier90 commented 3 weeks ago

In the pcluster config (https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/2.aws-parallelcluster/distributed-training-p4de-base.yaml#L42-L43), there is a comment saying we should set PlacementGroup to Enabled: false when using a targeted ODCR.

May I know if this applies for a targeted "Capacity Blocks for ML"?

sean-smith commented 3 weeks ago

Placement groups can conflict with the ODCR leading to an insufficient capacity exception error (ICE).

Targeted ODCR's are already mapped to a single spine in an AZ so it's redundant and causes ICE errors.

liyier90 commented 3 weeks ago

Thanks for the explanation @sean-smith . May I know if there is a distinction between ODCR and Capacity Block? Because Capacity Block is listed as a different reservation type on the console.

chrome_5rpCWJUicS