amazonlinux / amazon-linux-2023

Amazon Linux 2023
https://aws.amazon.com/linux/amazon-linux-2023/
Other
529 stars 39 forks source link

[Feature Request] - GPU AMIs, NVIDIA drivers, CUDA support: i.e. allowing use of GPUs on GPU instance types #12

Open stewartsmith opened 2 years ago

stewartsmith commented 2 years ago

GPU instance types tend to require additional software to take advantage of the GPUs. Much like the GPU AMIs available for AL2, we want them for AL2022.

bryantbiggs commented 1 year ago

Related issues:

bryantbiggs commented 9 months ago

would this include NCCL? Ref https://github.com/amazonlinux/amazon-linux-2023/issues/491 which relies on NCCL

stewartsmith commented 4 months ago

NVIDIA now has documentation on using their drivers on AL2023. See https://docs.nvidia.com/cuda/pdf/CUDA_Quick_Start_Guide.pdf and https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#amazon-linux-2023

rdelcampog commented 4 months ago

@stewartsmith Any ETA for this? Lots of projects are on hold waiting for this 😇

limmike commented 1 month ago

You can refer to How do I install NVIDIA GPU driver, CUDA toolkit and optionally NVIDIA Container Toolkit on Amazon Linux 2023 (AL2023)? for user data script.

jacob-talroo commented 1 month ago

FYI - I am following this ticket so that we don't have to do anything else to use the latest Linux with Nvidia drivers on AWS. We desire simplicity.

Our use case is bringing up these instances quickly. So, if there are extra install steps, this adds to our scale uptime. Of course, we can package our own AMI, but this is sophisticated. An additional benefit of AWS bundling a recent version of Linux and Nvidia drivers together is AWS supporting this configuration.

If this simplicity is not the desired goal of this ticket, let me know and I'll file another one.

limmike commented 1 month ago

Thanks for sharing your use case. Included the link in case others wants to build their instances first while waiting.

ozbenh commented 1 month ago

Thanks for the feedback. We are looking at improving the integration of the nVidia driver packages. I cannot yet say wether we'll have a dedicated AMI or not (or if another AWS team will produce one) but we are investigating removing the need to source packages from elsewhere.

sammcj commented 4 weeks ago

Really keen to see some CUDA AL2023 AMIs made available, installing cuda-toolkit nvidia-container-toolkit and the nvidia-driver in a user script is painfully slow.

bryantbiggs commented 4 weeks ago

If using containers, the CUDA toolkit isn't recommended on the host - CUDA will be provided within your application container as a user space dependency

sammcj commented 4 weeks ago

@bryantbiggs can you still pass through runtime: nvidia and deploy with driver:nvidia though?

If not, that breaks compatibility with non-AWS deployments, container images and compose files.

bryantbiggs commented 4 weeks ago

I don't follow - but I suspect yes.

Take PyTorch as an example - that is installed in your container (either via pip install or by using one of the PyTorch public images) and it provides the CUDA libraries that it requires so the massive install of the CUDA toolkit on your host isn't used

sammcj commented 4 weeks ago

Ohh, sorry I was thinking of the container toolkit 🤦, yes you're quite right!

FYI this is my current, pretty disgusting instance bootstrap script 😅:

#!/bin/bash
set -euox pipefail

# Function for logging
log() {
  echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*" | tee -a /var/log/gpu-instance-setup.log
}

# Error handling
handle_error() {
  log "Error occurred in line $1"
  # exit 1
}

trap 'handle_error $LINENO' ERR

# Enable SSM agent
log "Enable SSM agent"
sudo systemctl enable amazon-ssm-agent --now

# Set environment variables
export TMPDIR=/home/ec2-user/tmp
export CC=/usr/bin/gcc10-cc

# # Update and install packages
log "Updating system"
sudo echo "fastestmirror=true" | sudo tee -a /etc/dnf/dnf.conf
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo
curl -fsSL https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo yum update -y

# Install NVIDIA drivers
log "Installing NVIDIA drivers"
sudo dnf install -y dkms kernel-devel kernel-modules-extra
sudo dnf clean expire-cache
sudo dnf -y module install nvidia-driver:latest-dkms

# # Install NVIDIA Container Toolkit
log "Installing Nvidia Container Toolkit packages"

# AWS doesn't have a working AL2023 AMI for nvidia (see https://github.com/amazonlinux/amazon-linux-2023/issues/12), so we need to install the toolkit manually
sudo dnf install -y docker nvidia-container-toolkit

log "Installing docker compose"
sudo mkdir -p /usr/local/lib/docker/cli-plugins
# shellcheck disable=SC2154
sudo curl -SL "https://github.com/docker/compose/releases/download/v${compose_version}/docker-compose-linux-x86_64" -o /usr/local/lib/docker/cli-plugins/docker-compose
sudo chmod +x /usr/local/lib/docker/cli-plugins/docker-compose
sudo docker compose version

# Configure Docker
log "Configuring Docker"
sudo usermod -aG docker ec2-user
sudo systemctl enable docker.service
sudo nvidia-ctk runtime configure --runtime=docker

sudo sed -i 's/"runtimes": {/"max-concurrent-downloads": 8,\n    "max-concurrent-uploads": 4,\n    "runtimes": {/' /etc/docker/daemon.json
sudo systemctl restart docker

# # Function to wait for Docker to be ready
wait_for_docker() {
  local max_attempts=30
  local attempt=1
  while ! docker info >/dev/null 2>&1; do
    if [ $attempt -eq $max_attempts ]; then
      log "Docker did not become ready in time"
      return 1
    fi
    log "Waiting for Docker to be ready (attempt $attempt/$max_attempts)..."
    sleep 5
    ((attempt++))
  done
  return 0
}

log "Installing additional tooling packages"
sudo dnf install -y wget htop tmux git jq iftop python3-pip
sudo pip install -U nvitop &

# Add an alias to the ollama command
echo "alias ollama='docker exec -it ollama ollama'" >>~/.bashrc
echo "alias ollama='docker exec -it ollama ollama'" | sudo tee -a /home/ec2-user/.bashrc >/dev/null

# # Wait for Docker to be ready
wait_for_docker

### Docker Compose via SSM ###

# Fetch docker-compose.yaml from SSM Parameter Store
# shellcheck disable=SC2154
sudo aws ssm get-parameter --name "/ai-accelerator/docker-compose" --with-decryption --region "${aws_region}" --query "Parameter.Value" --output text | sudo tee /home/ec2-user/docker-compose.yaml >/dev/null

# Set correct ownership and permissions
sudo chown ec2-user:ec2-user /home/ec2-user/docker-compose.yaml
sudo chmod 644 /home/ec2-user/docker-compose.yaml

# Navigate to the directory containing docker-compose.yaml and start the services
sudo sh -c 'cd /home/ec2-user && docker compose up -d'

log "GPU instance setup completed successfully"

For context - for "normal" workloads I'd just deploy things to ECS/EKS, but there's a few cases where I'm needing to spin up AI/LLM tooling in a big ol' dumb GPU optimised EC2 instance for PoCs/testing (also as Bedrock is basically useless in the AU regions).