Quansight / open-gpu-server

The Open GPU Server for CI purpose.
8 stars 12 forks source link

Add conda-forge-admin to allowed users #17

Open carterbox opened 9 months ago

carterbox commented 9 months ago

To obtain access to the CI server, you must complete the form below:

If the conda-forge-admin will be opening PRs/migrations on feedstocks that use this CI, then it should also have permission to use this CI?

https://github.com/conda-forge/libmagma-feedstock/pull/14

jakirkham commented 9 months ago

Thanks Daniel! 🙏 Good point

Are there other bots we would want to add @beckermr ?

@jaimergp what do you think about adding bots here?

beckermr commented 9 months ago

This is all we want for now. We'll need to figure out how the server interacts with GitHub apps.

jaimergp commented 9 months ago

Hm, it might be a gateway for any user to trigger the CI without actually having signed the TOS, so I'm not sure we should authorize bots. It might be a limitation, where an authorized user needs to push an empty commit right after the bot has done its thing.

jakirkham commented 9 months ago

Interesting that might affect automerge as well since that is another bot commit

Maybe we can mull on this one some more and see what options we can come up with?

jaimergp commented 9 months ago

Yep, of course. Not saying this will never be possible.

There might be solutions we can implement at the trigger validation level, but right now, with what we have, since any Github user can trigger a bot command (rerender, add maintainer, update version, etc), adding conda-forge-admin as a valid user is the same as allowing everyone.

We will have to document the automation limitations so people are not super surprised.

PS: I do hope that folks do not rely on automerge for 6h+ builds though hah.

jakirkham commented 9 months ago

If there was a good way to have users' own authentication used with bots, we might be able to register things like re-rendering commits or automerge commits with users

That said, this may involve work on the conda-forge side, the Quansight side, and maybe talking to GitHub to work through issues that come up

Maybe that would be one path


As to +6hr builds and automerge, interestingly long running builds are almost best managed with automerge. Long running builds are slow to get feedback from and can be forgotten about while doing other work. It was one of the things that prompted the automerge addition in conda-forge

Though with such long running builds we may want to think about scheduling as well (or I guess some things can stay pending for a while)

One more thing to think about I guess 🙂

beckermr commented 9 months ago

We'd have to ask users to install a conda app in order to act on their behalf. An easier approach might be to use one of our bots only for certain actions that would trigger this ci and then gate when we use this bot.

jakirkham commented 9 months ago

Ok cool we have a couple ideas

Are there any others we have not yet considered?

dharhas commented 9 months ago

Also remember we need to gate keep things from a resource constraint pov as well, this a single server with a small number of available runners and no scaling ability.

jakirkham commented 9 months ago

This is good to keep in mind Dharhas

For more context on the bot cases mentioned above, these are a maintainer takes an action where a bot assists them. The maintainer could take these actions themselves. Also the action taken will simply start a set of jobs equivalent to a single commit on a single repo (so no combinatorial explosion)

Hope that helps clarify things. Happy to help answer any questions 🙂

beckermr commented 9 months ago

A maintainer who has not agreed to the TOS could automerge a pr by another maintainer who has if we add the bots.

jakirkham commented 9 months ago

Think that falls under the authorization discussion above (and how we handle that)

Understood Dharhas' concern to be about resource oversubscription, which is orthogonal

Under the current conda-forge model don't see automerge or re-render presenting this sort of risk

A migrator could be a source of risk (unless we regulate how many PRs it opens)

Also a maintainer pushing individual commits in rapid succession

Does that make more sense?

To this end what sort of throttling logic does the CI have? Is autocancellation available (so only the latest change keeps running)?

jaimergp commented 9 months ago

Is autocancellation available (so only the latest change keeps running)?

Yes, via the conncurrency settings in the GHA workflow.

A migrator could be a source of risk (unless we regulate how many PRs it opens)

Yes, but it's not if we ask maintainers to push an empty commit to trigger the extra CI if they do need it. I am assuming that the GPU server will not be the only CI provider in most feedstocks, so debugging can happen in the regular Azure runners and then the GPU ones are enabled only when certain it will work. This assumption might be wrong, but tbh it might be the only natural way given the current limitation of 4 concurrent GPU jobs.

carterbox commented 9 months ago

I think the way that CuPy protects its CI is that jobs always have to be manually started by commenting on the PR with a specific command e.g. /test [type of test].

Requiring an authorized user to write conda-forge-admin, please run ci in a comment for every job that is queued here from a PR would be the most conservative and intuitive method I think. Having to push an empty commit seems like friction?

jakirkham commented 9 months ago

We've done similar things in RAPIDS and Dask for running CI

It can still cause friction ofc, but I think our goal is merely to minimize friction (not necessarily eliminated given our constraints). To that end agree a comment is less friction

jaimergp commented 9 months ago

We can probably add this in the GHA workflow directly, and then have Cirun allow the pull request comment trigger. Better than an empty commit, I agree.