NVIDIA / NVFlare

NVIDIA Federated Learning Application Runtime Environment
https://nvidia.github.io/NVFlare/
Apache License 2.0
592 stars 165 forks source link

AWS cloud deployments: Adding multi-region capability and auto-detection of GPU instance types and AMI images #2650

Closed dirkpetersen closed 1 month ago

dirkpetersen commented 2 months ago

Description

Deploying cloud infrastructure across multiple AWS regions can be challenging due to each region having unique AMI image IDs. With these proposed changes, the user will be prompted to select their region first, followed by their desired image name (default: ubuntu-*-22.04-amd64-pro-server). After confirming an amd64 or arm64 image name pattern, NVFlare will look up the appropriate AMI image ID, identify a compatible GPU instance type based on resources.json, and present the user with these new defaults. Dynamically obtaining the AMI ID also ensures that the user always has the latest image with all security updates already installed.

Types of changes

I tested the following combinations over the last few days but only the 24.04 ARM option seems to fail with NVidia drivers 535 and 550. The instances just freeze after installation of the driver. This is unrelated to NVFlare

image

The new UI would look like this

startup/start.sh --cloud aws
This script requires aws (AWS CLI), sshpass, dig and jq.  Now checking if they are installed.
Checking if aws exists. => found
Checking if sshpass exists. => found
Checking if dig exists. => found
Checking if jq exists. => found
Note: run this command first for a different AWS profile:
  export AWS_PROFILE=your-profile-name.

Checking AWS identity ...

* Cloud EC2 region, press ENTER to accept default: us-east-2
* Cloud AMI image name, press ENTER to accept default (use amd64 or arm64): ubuntu-*-22.04-amd64-pro-server
    retrieving AMI ID for ubuntu-*-22.04-amd64-pro-server ... ami-0a7e9bed072bb379b found
    finding smallest instance type with 1 GPUs and 15360 MiB VRAM ... g6.xlarge found
* Cloud EC2 type, press ENTER to accept default: g6.xlarge
* Cloud AMI image id, press ENTER to accept default: ami-0a7e9bed072bb379b
region = us-east-2, EC2 type = g6.xlarge, ami image = ami-0a7e9bed072bb379b , OK? (Y/n)
If the client requires additional Python packages, please add them to:
    /home/dp/NVFlare/dirk/Test/AWS-T4.X/startup/requirements.txt
Press ENTER when it's done or no additional dependencies.

Checking if default VPC exists
Default VPC found
Generating key pair for VM
Creating VM at region us-east-2, this may take a few minutes ...
VM created with IP address: 52.14.44.113
Copying files to nvflare_client
Destination folder is ubuntu@52.14.44.113:/var/tmp/cloud
Installing os packages as root in the background, this may take a few minutes ...
Installing user space packages in the background, this may take a few minutes ...
System was provisioned, packages may continue to install in the background.
To terminate the EC2 instance, run the following command:
  aws ec2 terminate-instances --region us-east-2 --instance-ids i-0837e105f2661a4e3
Other resources provisioned
security group: nvflare_client_sg_2036
key pair: NVFlareClientKeyPair
review install progress:
  tail -f /tmp/nvflare-aws-YGR.log
login to instance:
  ssh -i "/home/dirk/AWS-T4.X/NVFlareClientKeyPair_i-0837e105f2661a4e3.pem" ubuntu@52.14.44.113

Note that commit 2c035cd was updated via force-push to ccd3bc3 because of a wrong port (22 instead of 8002-8003)

YuanTingHsieh commented 2 months ago

@dirkpetersen could you try signing your commits? thanks

chesterxgchen commented 1 month ago

@IsaacYangSLA should we merge this PR if its already approved.

IsaacYangSLA commented 1 month ago

/build

IsaacYangSLA commented 1 month ago

/build