Add GPU Runners for Github Actions

seanlaw commented 4 months ago

At the SciPy 2024 conference, I learned that free GPU runners are available via the Quansight's/MetroStar's "Open GPU Server". We may consider using this in the future

joehiggi1758 commented 2 months ago

@seanlaw hey Sean - hope you had a wonderful weekend!

Scouting for my next contribution and this one looks interesting! What would next steps look like?

seanlaw commented 2 months ago

@joehiggi1758 This one is less about code contribution and more about "what are the concrete steps for getting access to a GPU so that we can execute our GPU unit tests?"

If you'd like to help us answer this question and come up with a plan then that would be super helpful. Frankly, it's not entirely clear what is being offered above and whether or not it is even useful for what we need. It could very well be a "nope, it's not quite what we need" and we move on. So this is a fact finding mission.

joehiggi1758 commented 2 months ago

@seanlaw ahh okay got it - so more of an open ended task at this point, to formalize a plan!

I'm happy to help here and will get us a plan/framework put together!

seanlaw commented 2 months ago

Awesome! Thank you for your willingness to take on this ill-formed/ambiguous task

joehiggi1758 commented 2 months ago

@seanlaw hope you're having a wonderful evening! Here's a high level plan/write up on the above! Please let me know if this is not the direction you hoped for and I'm happy to pivot!

High-Level Overview: Quansight and MetroStar's "Open GPU Server" is an initiative designed to provide GPU resources for continuous integration purposes, particularly for the conda-forge community.

Regarding hardware available...

GPU
- NVIDIA Tesla V100 GPUs: 6 units
VM Configurations:
- GPU Runners
  - gpu_tiny:
    - vCPUs: 4
    - RAM: 2GB
    - Disk: 20GB
    - GPUs: 1x NVIDIA Tesla V100
  - gpu_medium:
    - vCPUs: 4
    - RAM: 8GB
    - Disk: 50GB
    - GPUs: 1x NVIDIA Tesla V100
  - gpu_large:
    - vCPUs: 4
    - RAM: 12GB
    - Disk: 60GB
    - GPUs: 1x NVIDIA Tesla V100
  - gpu_xlarge:
    - vCPUs: 8
    - RAM: 16GB
    - Disk: 60GB
    - GPUs: 1x NVIDIA Tesla V100
  - gpu_2xlarge:
    - vCPUs: 8
    - RAM: 32GB
    - Disk: 60GB
    - GPUs: 1x NVIDIA Tesla V100
  - gpu_4xlarge:
    - vCPUs: 8
    - RAM: 64GB
    - Disk: 60GB
    - GPUs: 1x NVIDIA Tesla V100

Read and agree, if agreeable, to terms and conditions listed at open-gpu-server/TOS.md
Open PR to add STUMPY to open-gpu-server/access/conda-forge-users.json
Configure GitHub Actions workflow to use GPU runners
Specify the appropriate runner labels in the GitHub Actions workflow configuration file (i.e:
```
name: CI with GPU
```

on: [push, pull_request]

jobs: build: runs-on: gpu_large # Specification of GPU runner ...


5. Open a STUMPY pull request
6. Merge GitHub Actions workflow to main

seanlaw commented 2 months ago

Thanks @joehiggi1758 Did anything in the TOS catch your eye and that might be problematic?

From a hardware standpoint, I think gpu_tiny or, at most, gpu_medium should be sufficient for our needs. We're primarily interested in testing the GPU code. I'm thinking that we add a .github/workflows/gpu.yml workflow that performs the GPU unit tests only (via ./test.sh gpu)

joehiggi1758 commented 2 months ago

@seanlaw of course - happy to help!

The two sections that caught my eye were...

Aggregated Statistics. Notwithstanding anything to the contrary in this Agreement, Provider may monitor User's use of the Services and collect and compile data and information related to User's use of the Services to be used by Provider in an aggregated and anonymized manner, including to compile statistical and performance information related to the provision and operation of the Services ("Aggregated Statistics").
Third-Party Products. The Services may permit access to Third-Party Products. For purposes of this Agreement, such Third-Party Products are subject to their own terms and conditions presented to you for acceptance within the Services by website link or otherwise. If you do not agree to abide by the applicable terms for any such Third-Party Products, then you should not install, access, or use such Third-Party Products.

The first statement reads to me, "we can capture some data about your use of our GPUs and store it, and we will anonymize it" and the second reads to me as, "we reserve the right to work with third party vendors - which you agree to their terms/risks by using our GPUs".

I believe both of these statements to be fine, and relatively standard for our purposes, but I'm not sure if you play by different rules about exposing data or agreeing to terms relating to external entities, given that you're under the TDAmeritrade/Scwhab umbrella.

Also - that makes sense on the hardware front, should we open an issue to write that '.yml'?

seanlaw commented 2 months ago

@joehiggi1758 I agree. Aside from Github (password) secrets, everything else is open information since we are fully open source. I think we are okay to move forward.

As you wrote above, I think the first step is to "agree" and add STUMPY to `open-gpu-server/TOS.md. Would you mind doing this first (feel free to tag me in that PR/issue) and, after we get the green light, we then come back to the Github workflow. How does that sound?

joehiggi1758 commented 2 months ago

@seanlaw of course - I'm on it!

joehiggi1758 commented 2 months ago

@seanlaw we've been merged into main for access to Quansight's GPU's!

Want me to open an issue for a GPU workflow?

seanlaw commented 2 months ago

Want me to open an issue for a GPU workflow?

Open an issue or a PR? Are there examples where others have done this successfully? Is there an intermediate step that might allow us to test things out (i.e., test out our access)?

My concern is that we'll need to make a bunch of PRs here in this repo in order to test (rather than a single PR or maybe a couple) and I'd like to avoid that if possible.

joehiggi1758 commented 2 months ago

@seanlaw hey Sean - hope you had a wonderful weekend!

Apologies I meant open a PR
I can try to track down examples of where others have done this successfully
I believe an intermediate step would be to spin up a docker container locally and test in an isolated environment

As a plan of attack, first to assist with access testing I have requested to be added to Quansight's open GPU server here, second, I will test access locally and let you know what I find out! Does that work?

seanlaw commented 2 months ago

Does that work?

Sounds good!

jaimergp commented 2 months ago

FWIW, that access list is only for conda-forge repositories, not general usage. So far we haven't offered access to the resources outside conda-forge.

seanlaw commented 2 months ago

@jaimergp Can you further explain what that means and what we can/can't do? STUMPY has a conda-forge feedstock but that only gets triggered when we bump the latest PyPI version. What we'd like to do to is to run our GPU unit tests as a new PR/commit comes in

jaimergp commented 2 months ago

Ah, sorry, I didn't see any mentions of conda-forge in this ticket so I incorrectly assumed you were trying to add the server CI directly here, not in your stumpy feedstock. Apologies.

When you modify the recipe in the feedstock, add the necessary tests, but there's no need to test the whole suite.

seanlaw commented 2 months ago

@jaimergp No need to apologize at all and we appreciate your help! From your description, it sounds like adding the recipe to the conda feedstock would mean that the underlying package, STUMPY, has already been loaded into PyPI? Our current process is:

Change and commit code to the STUMPY codebase
Repeat Step 1 until a new version is ready to be released
Release latest version to PyPI
Update conda-forge feedstock to pick up latest version from PyPI and release the latest version

However, we would like to run our GPU tests in Step 1 as new changes/commits (to our GPU code) occur and NOT after a new version is released to PyPI (by that time, it is too late to catch any GPU bugs/errors).

Maybe I'm misunderstanding the point of accessing this GPU resource? What is the primary use case?

jaimergp commented 2 months ago

Exactly, the GPU resource are only available during (4). The primary use case is for redistribution QA. Making sure we have compiled things in the right way and asserting they would install and work correctly in end users machines.

For day-to-day development I'm afraid our server is insufficient to meet the general demands. You may look into https://docs.gha-runners.nvidia.com/ or the other solutions discussed at https://github.com/zarr-developers/zarr-python/issues/2041.

seanlaw commented 2 months ago

Thanks for confirming @jaimergp and for sharing alternative options! We will need to investigate if this is worth it. We don't have any funding so "free" is what we are looking for.

seanlaw commented 1 month ago

Closing this for now and may revisit in the future.

TDAmeritrade / stumpy

Add GPU Runners for Github Actions #1005