clusterinthecloud / support

If you need help with Cluster in the Cloud, this is the right place
2 stars 0 forks source link

Add support for GPU instances on AWS #27

Open willprice opened 3 years ago

willprice commented 3 years ago

Currently launching instances with GPUs on AWS does not provision the VMs with the necessary drivers capable of interacting with the GPUs. It would be good to have some documentation for people who wish to use CitC in this manner. I plan on working on this today and will hopefully submit some PRs with instructions on this.

colinsauze commented 3 years ago

For my first go at getting a GPU image build I added the following to compute_image_extra.sh, just got this built but nvidia-smi is complaining about drivers.

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo sudo dnf clean all sudo dnf -y module install nvidia-driver:latest-dkms sudo dnf -y install cuda

willprice commented 3 years ago

As noted by @colinsauze, It is also necessary to increase the size of the image, this can be achieved by adding the following

    launch_block_device_mappings { 
           device_name = "/dev/sda1"
           volume_size =  40
    }

to the end of the source "amazon-ebs" "aws" section in /etc/citc/packer/all.pkr.hcl

willprice commented 3 years ago

It is also necessary to install kernel-devel before install the nvidia drivers to ensure that the dkms module can be built, without that it will fail.

willprice commented 3 years ago

Docs are being updated at https://github.com/willprice/docs/blob/aws-nvidia-instructions/source/running.rst#aws-gpu-nodes

willprice commented 3 years ago

Once https://github.com/clusterinthecloud/docs/pull/17 is merged, this can be closed.