aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
203 stars 86 forks source link

Adding the CF template for FSx Lustre for HyperPod EKS workshop #459

Closed shimomut closed 1 month ago

shimomut commented 1 month ago

Description of changes:

Starting to maintain the CloudFormation template for FSx Lustre in this repository. This template is used in the HyperPod EKS workshop - FSxL static provisioning section (https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/01-cluster/06-fsx-for-lustre#static-provisioning-(optional)).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

mhuguesaws commented 1 month ago

What's the value compare to https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/3.FSxLustre.yaml ?

shimomut commented 1 month ago

Didn't pay enough attention to the existing template for HyperPod Slurm, but checking it now.

shimomut commented 1 month ago

The existing FSxL template for HyperPod Slurm has a dependency to "2.SageMakerVPC.yaml", and referring to the Subnet and Security Group created by the 2.SageMakerVPC.yaml.

      SecurityGroupIds:
        - Fn::ImportValue:
            !Sub "${NetworkStack}-SecurityGroup"
      SubnetIds:
        - Fn::ImportValue:
            !Sub "${NetworkStack}-PrivateSubnet"

We could change this existing template to support both HyperPod Slurm and HyperPod EKS by stopping importing these values, and letting customers manually specify the values. Does it seem like a good plan ? @mhuguesaws

mhuguesaws commented 1 month ago

Yes let's minimize duplication.

shimomut commented 1 month ago

Got it, I am withdrawing this PR, and will work on a separate one to reuse existing template.