I-GUIDE / CI_Platform

iGUIDE CI Platform Deployment
Apache License 2.0
0 stars 0 forks source link

GPU Integration #3

Open fbaig opened 9 months ago

fbaig commented 9 months ago

Problem Larger (especially ML/AL) workloads require access to GPUs for parallel processing. During summer school 2023, GPU instances were made available in the platform backed by Anvil and JetStream GPU nodes. However, keeping these instances running burn a lot of SUs. At the same time, it's difficult to assign a single GPU instances per user.

Potential Solution

Pull Requests ToDo ...

alexandermichels commented 9 months ago

We have a very simple test case working on Keeling with CyberGIS-Compute (https://github.com/alexandermichels/geoai-model-testing.git), but this isn't a full solution. Matching software versions exactly between software and HPC is going to be difficult. It's possible that we could install software stacks with NVIDIA through CVMFS, but CVMFS with Compute is also in a very experimental stage currently. Open to any suggestions.

Edit: the repo I linked has a Dockerfile from an undergrad, but it is not the one we are currently using. This is our current image: https://github.com/cybergis/docker-images/tree/main/cybergis-compute/pytorch

rkalyanapurdue commented 9 months ago

It might be useful to look into Nvidia NGC containers. We have them deployed on Anvil via the module system. I think there might be corresponding docker images we can adapt for Jupyter integration.

However these default images did not seem to work for the summer school team 6 use case, so that might need another look..

fbaig commented 9 months ago

Thread-1: Direct access to GPU (Lead by @nosolls ) Launch GPU instance on the fly only when a user requests it.

fbaig commented 9 months ago

Thread-2: GPU Access with CyberGIS-Compute (Lead by @alexandermichels )

yirugi commented 9 months ago

2/26 Hacking session:

Modified terraform configs for JetStream K8s to add NVIDA GPU support

Created Dockerfile for Jupyter Notebook + NVIDIA + CVMFS

rkalyanapurdue commented 4 months ago

Relies on #17 ; will need to upgrade the Kubernetes version before we can use TerraForm and Ansible to attach a new GPU node to the cluster. Older versions of Kubernetes are no longer accessible from the apt repos.