Closed seanlaw closed 1 month ago
@seanlaw hey Sean - hope you had a wonderful weekend!
Scouting for my next contribution and this one looks interesting! What would next steps look like?
@joehiggi1758 This one is less about code contribution and more about "what are the concrete steps for getting access to a GPU so that we can execute our GPU unit tests?"
If you'd like to help us answer this question and come up with a plan then that would be super helpful. Frankly, it's not entirely clear what is being offered above and whether or not it is even useful for what we need. It could very well be a "nope, it's not quite what we need" and we move on. So this is a fact finding mission.
@seanlaw ahh okay got it - so more of an open ended task at this point, to formalize a plan!
I'm happy to help here and will get us a plan/framework put together!
Awesome! Thank you for your willingness to take on this ill-formed/ambiguous task
@seanlaw hope you're having a wonderful evening! Here's a high level plan/write up on the above! Please let me know if this is not the direction you hoped for and I'm happy to pivot!
High-Level Overview: Quansight and MetroStar's "Open GPU Server" is an initiative designed to provide GPU resources for continuous integration purposes, particularly for the conda-forge community.
Regarding hardware available...
open-gpu-server/TOS.md
open-gpu-server/access/conda-forge-users.json
name: CI with GPU
on: [push, pull_request]
jobs: build: runs-on: gpu_large # Specification of GPU runner ...
5. Open a STUMPY pull request
6. Merge GitHub Actions workflow to main
Thanks @joehiggi1758 Did anything in the TOS catch your eye and that might be problematic?
From a hardware standpoint, I think gpu_tiny
or, at most, gpu_medium
should be sufficient for our needs. We're primarily interested in testing the GPU code. I'm thinking that we add a .github/workflows/gpu.yml
workflow that performs the GPU unit tests only (via ./test.sh gpu
)
@seanlaw of course - happy to help!
The two sections that caught my eye were...
The first statement reads to me, "we can capture some data about your use of our GPUs and store it, and we will anonymize it" and the second reads to me as, "we reserve the right to work with third party vendors - which you agree to their terms/risks by using our GPUs".
I believe both of these statements to be fine, and relatively standard for our purposes, but I'm not sure if you play by different rules about exposing data or agreeing to terms relating to external entities, given that you're under the TDAmeritrade/Scwhab umbrella.
Also - that makes sense on the hardware front, should we open an issue to write that '.yml'?
@joehiggi1758 I agree. Aside from Github (password) secrets, everything else is open information since we are fully open source. I think we are okay to move forward.
As you wrote above, I think the first step is to "agree" and add STUMPY to `open-gpu-server/TOS.md
. Would you mind doing this first (feel free to tag me in that PR/issue) and, after we get the green light, we then come back to the Github workflow. How does that sound?
@seanlaw of course - I'm on it!
@seanlaw we've been merged into main for access to Quansight's GPU's!
Want me to open an issue for a GPU workflow?
Want me to open an issue for a GPU workflow?
Open an issue or a PR? Are there examples where others have done this successfully? Is there an intermediate step that might allow us to test things out (i.e., test out our access)?
My concern is that we'll need to make a bunch of PRs here in this repo in order to test (rather than a single PR or maybe a couple) and I'd like to avoid that if possible.
@seanlaw hey Sean - hope you had a wonderful weekend!
As a plan of attack, first to assist with access testing I have requested to be added to Quansight's open GPU server here, second, I will test access locally and let you know what I find out! Does that work?
Does that work?
Sounds good!
FWIW, that access list is only for conda-forge repositories, not general usage. So far we haven't offered access to the resources outside conda-forge.
@jaimergp Can you further explain what that means and what we can/can't do? STUMPY has a conda-forge feedstock but that only gets triggered when we bump the latest PyPI version. What we'd like to do to is to run our GPU unit tests as a new PR/commit comes in
Ah, sorry, I didn't see any mentions of conda-forge in this ticket so I incorrectly assumed you were trying to add the server CI directly here, not in your stumpy feedstock. Apologies.
When you modify the recipe in the feedstock, add the necessary tests, but there's no need to test the whole suite.
@jaimergp No need to apologize at all and we appreciate your help! From your description, it sounds like adding the recipe to the conda feedstock would mean that the underlying package, STUMPY, has already been loaded into PyPI? Our current process is:
However, we would like to run our GPU tests in Step 1 as new changes/commits (to our GPU code) occur and NOT after a new version is released to PyPI (by that time, it is too late to catch any GPU bugs/errors).
Maybe I'm misunderstanding the point of accessing this GPU resource? What is the primary use case?
Exactly, the GPU resource are only available during (4). The primary use case is for redistribution QA. Making sure we have compiled things in the right way and asserting they would install and work correctly in end users machines.
For day-to-day development I'm afraid our server is insufficient to meet the general demands. You may look into https://docs.gha-runners.nvidia.com/ or the other solutions discussed at https://github.com/zarr-developers/zarr-python/issues/2041.
Thanks for confirming @jaimergp and for sharing alternative options! We will need to investigate if this is worth it. We don't have any funding so "free" is what we are looking for.
Closing this for now and may revisit in the future.
At the SciPy 2024 conference, I learned that free GPU runners are available via the Quansight's/MetroStar's "Open GPU Server". We may consider using this in the future