Open JohnTigue opened 1 year ago
Decided to stay close to AWS, like the Russians stayed close to that Nazis as Stalingrad, that way the Luftwaffe could drop bombs on them. So using Amazon Linux 2 as the base. And it just seems wise to let AWS work out the drivers installation stuff. Maybe they do something extra right so things run faster. Dunno. But here AWS's AMI to start with: Amazon Linux 2 AMI with NVIDIA TESLA GPU Driver.
AKA: amzn2-ami-graphics-hvm-2.0.20230119.1-x86_64-gp2-e6724620-3ffb-4cc9-9690-c310d8e794ef ami-053fbfc5ab768846d.
Latest Version (2.0.20230119.1) sounds about a week old so they keep it fresh.
Here's AWS docs which clicked through to the above AMI: Install NVIDIA drivers on Linux instances.
That AMI doesn't seem to work with a g5? Failed twice.
So, let's just trying vanilla AL2: Amazon Linux 2 Kernel 5.10 AMI 2.0.20230119.1 x86_64 HVM gp2, on a g5.xlarge. That is what TakeSix is.
Sounds like we ourselves could do some impressive tuning of SD2.1 and PyTorch for a one third to one half speed up: Accelerated Stable Diffusion with PyTorch 2
The main goal is to have Docker images that get pushed to ECR and then downloaded later by ECS. Additionally we'd like to have SageMaker pull from there. And for dev/test on AWS it'd be nice to instantiate those on raw EC2 instances.
In the case of EC2 there is rigmarole that needs to be done to get the base machine running. E.g. Docker Compose v2 needs to be installed just to get stable-diffusion-webui-docker spun up. We can have a setup.sh which CloudFormation can run at instantiation. And also from the command line. Looks like such a thing could end out being about 20 lines:
https://github.com/marshmellow77/stable-diffusion-webui/blob/master/setup.sh
Nice work but building FROM Ubuntu, not Amazon Linux:
uname -m
will tell us what architecture the machine thinks it is. Good for figuring out what drivers to install.An argument for keeping the (large) model files in appdata, not in the core image (keep those small): GUIDE for INVOKEAI: A STABLE DIFFUSION TOOLKIT - DOCKER. Includes a Dockerfile and a start.sh.
How to install an Nvidia GPU driver on an AWS EC2 instance:
If you’re wondering why the GPU is not working out of the box for your AWS EC2 p2 instance using a Deep Learning AMI, it is because these Deep Learning AMIs are only supported on specific instance types such as G3, P3, P3dn, P4d, G5, G4dn.
Sure enough AWS's AMI with Tesla Drivers preinstalled can seemingly not be used for a g5.
Let's try the GRID drivers instead. That seems to allow g5 instances: NVIDIA RTX Virtual Workstation - Amazon Linux 2.
And, we're currently capped at 4 g5 GPUs. Just requested a quota limit increas. So, closing down some experiments in the mean time so can spin up TakeSeven.
Alright, g5 spins up with GRID drivers installed. Test probe for GPU works:
[root@ip-172-31-8-49 ~]# nvidia-smi
Mon Feb 6 04:36:20 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 18C P8 15W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
So, I'll take it that means g5's want GRID drivers, not Tesla. Sounds like GRID is for data centers, so that makes sense.
InvokeAI: Installing: Running the container on your GPU:
If you have an Nvidia GPU, you can enable InvokeAI to run on the GPU by running the container with an extra environment variable to enable GPU usage and have the process run much faster:
GPU_FLAGS=all ./docker/run.sh
This passes the --gpus all to docker and uses the GPU.
InvokeAI Stable Diffusion Toolkit Docs: Installing with the Automated Installer:
Check that your system has an up-to-date Python installed. To do this, open up a command-line window ("Terminal" on Linux and Macintosh, "Command" or "Powershell" on Windows) and type python --version. If Python is installed, it will print out the version number. If it is version 3.9.1 or 3.10.x, you meet requirements. At this time we do not recommend Python 3.11
And they've got version 3.7.16 installed. Lovely.
How To Install Python 3.10 on Amazon Linux 2:
The reason why we are installing Python 3.10 on Amazon Linux 2 by building it from source is that the available versions in the default repositories are not up to date.
Yup, this Amazon Linux is really tuned up. Much better that simply using Ubuntu like the majority of the PyTorch and SD community do. Yup. Great choice.
Actually, maybe the Deep Learning AMI actually is the right way to go, simply to get fresh Python pre-installed. It supports g5:
That still gets me back to having a stale Docker install (the Docker Compose v2 hassle) but that's a lot less hassle than setting up a machine do dev (compile Python 3.10 from source).
So, let's try out a TakeEight using the Deep Leaning AMI, which is seemingly AWS doing all this hassle build work for machine learning. Might have some of that magic tuning pixie dust in it.
Deep Learning AMI GPU PyTorch 1.13.1 (Amazon Linux 2) 20230201
- ami-07b46515572e215a2 (64-bit (x86))
- Virtualization: hvm
- ENA enabled: true
- Root device type: ebs
And that STILL only has Python 3.7. That's lame.
[root@ip-172-31-7-117 ~]# python3
Python 3.7.16 (default, Dec 15 2022, 23:24:54)
[GCC 7.3.1 20180712 (Red Hat 7.3.1-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
I guess I'm back to building Python from scratch because Invoke wants 3.9 or better. Sounds like I should go for 3.10.9 because Installation says:
We recommend Version 3.10.9, which has been extensively tested with InvokeAI.
How to install Docker Compose V2 as CLI plugin on Amazon Linux 2:
mkdir -p /usr/local/lib/docker/cli-plugins
curl -SL https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-linux-x86_64 -o /usr/local/lib/docker/cli-plugins/docker-compose
chmod +x /usr/local/lib/docker/cli-plugins/docker-compose
ls -l /usr/local/lib/docker/cli-plugins/docker-compose
That gets Docker to where the stable-diffusion-webui-docker repo needs it to be. So, then download and spin up A1111 (auto
):
$ git clone https://github.com/AbdBarho/stable-diffusion-webui-docker.git
$ cd stable-diffusion-webui-docker
$ docker compose --profile download up --build
$ docker compose --profile auto up --build
Think we're going to soon have a mountable volume of vetted models. See https://github.com/ManyHands/hypnowerk/issues/56#issuecomment-1420120131.
stable-diffusion-webui-docker says:
You have to use one of AWS's GPU-enabled VMs and their Deep Learning OS images. These have the right divers, the toolkit and all the rest already installed and optimized. #70
If we were not using one of the Amazon Deep Learning AMIs, we could build our own AMI which would have Nvidia's GPU-in-Docker tools installed: Setting up NVIDIA Container Toolkit.
This is the Docker image which is the single user SD workstation i.e. a private instance with a GPU and:
Etymology: