dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 718 forks source link

Stop GPU CI bot #5154

Closed mrocklin closed 2 years ago

mrocklin commented 3 years ago

@charlesbluca FYI the GPU CI bot is currently pinging every PR. I recommend that we change this bot so that it doesn't get this involved in day-to-day development.

charlesbluca commented 3 years ago

Would you prefer that we switch the bot implementation such that gpuCI only runs on PRs with a specific comment (something like run gpu tests)? Or only that we disable it for now while we are sorting out issues in #5147?

mrocklin commented 3 years ago

I'm happy leaving what triggers the bot to you all (I think that you have more experience here than I do).

However I am going to add the constraint that GPU CI not interfere with normal non-GPU development. Having a comment on every PR can be confusing to new devs and generally distracting.

On Mon, Aug 2, 2021 at 6:41 AM Charles Blackmon-Luca < @.***> wrote:

Would you prefer that we switch the bot implementation such that gpuCI only runs on PRs with a specific comment (something like run gpu tests)? Or only that we disable it for now while we are sorting out issues in

5147 https://github.com/dask/distributed/pull/5147?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/5154#issuecomment-891037054, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTAYH45YRE5TMMEVEL3T22N7DANCNFSM5BMX252A .

quasiben commented 3 years ago

Ideally we'd like this to run for everyone PR. I don't know how to distinguish between CPU and GPU code, For example, if an author changes something in dask/array we'd like to run that PR against gpuCI to ensure we are still support CuPy or at least fully aware and accept a breaking change.

The bot is only pinging on PR authors who are not publicly listed in the dask org. All Dask Admins can add those not publicly listed or in the Dask org to run on gpuCI with add to allowlist in a comment. Would you be willing to let this experiment go on for another day or two while we sort out some of these kinks ?

mrocklin commented 3 years ago

Yeah, taking time is welcome.

The bot is only pinging on PR authors who are not publicly listed in the dask org

Dask is maybe different from RAPIDS in that most PRs are written by people who are not in our organization. As a result, these messages maybe make less sense. Maybe people should be allowed by default and only not-allowed if they're added to a disallow-list?

On Mon, Aug 2, 2021 at 7:12 AM Benjamin Zaitlen @.***> wrote:

Ideally we'd like this to run for everyone PR. I don't know how to distinguish between CPU and GPU code, For example, if an author changes something in dask/array we'd like to run that PR against gpuCI to ensure we are still support CuPy or at least fully aware and accept a breaking change.

The bot is only pinging on PR authors who are not publicly listed in the dask org. All Dask Admins can add those not publicly listed or in the Dask org to run on gpuCI with add to allowlist in a comment. Would you be willing to let this experiment go on for another day or two while we sort out some of these kinks ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/5154#issuecomment-891061924, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBF67VR4TACQBWQQKLT22RVLANCNFSM5BMX252A .

quasiben commented 3 years ago

I don't think we want a disallow-list as we want to prevent folks from doing nasty things on the GPU. The github ci bot acts similarly -- that is, maintainers must approve first time contributors. After approval those authors, when they submit PRs, use Github actions like everyone else. The same would be true for gpuCI (after this initial burn in phase).

Looking at the last year of authors there are 55 contributors, there are 80 people in the dask org, and I assume a fair amount of overlap between authors this year and folks in the Dask org .

I would expect these message to be infrequent after the next day or two. Still, if you are thinking this is too noisy we'll try something else like a comment

quasiben commented 3 years ago

@jrbourbeau had the great idea of adding in the authors in the past year for dask/distributed to the allowlist to help reduce noise. This has now been done

mrocklin commented 3 years ago

+1

jrbourbeau commented 3 years ago

Thanks @quasiben! That should help cut down on a lot of the current noise

The github ci bot acts similarly -- that is, maintainers must approve first time contributors

I wonder if there's some permission we could give the gpuCI webhook that would let us hook into GitHub actions first-time contributor approval. That way there wouldn't be any additional notification or check beyond what GitHub already does

quasiben commented 3 years ago

I don't think there is but I will look and also talk with OPs folks. If there is not, would you like us to switch to commenting on PRs to trigger gpuCI rather than testing each PR ?

quasiben commented 3 years ago

OPs said that something custom could be built but it would be challenging to get this done.

@mrocklin I want to check in again. While things were a little were a little noisy on the first day, to me it seems like the noise has been significantly reduced. I did see James approve someone yesterday. There is a PR up to further explain gpuCI along with text informing folks on Dask's Github Actions usage.

Are we ok closing this issue ? Do you want us to reevaluate the model we have for gpuCI ?

jrbourbeau commented 3 years ago

Just following up here after discussing this in the latest monthly community meeting. Generally folks seem okay with the current gpuCI model, but would like to improve the comment the bot leaves for those not on the allowlist (xref https://github.com/dask/dask/pull/7985#issuecomment-893111874). @quasiben has asked NVIDIA OPs folks about our options for this (xref https://github.com/dask/dask/pull/7985#issuecomment-893500783).

quasiben commented 3 years ago

I believe gpuCI is using a Jenkins Plugin and the comment setting is global so it would be used for dask and RAPIDS projects. How do folks feel about:

Can one of the admins verify this pull request to run on gpuCI?

cc @GenevieveBuckley

GenevieveBuckley commented 3 years ago

I believe gpuCI is using a Jenkins Plugin and the comment setting is global so it would be used for dask and RAPIDS projects. How do folks feel about:

Can one of the admins verify this pull request to run on gpuCI?

cc @GenevieveBuckley

Are you asking for better suggestions? How about:

If this pull request needs GPU testing run, can one of the admins leave a comment saying "whatever the magic words are"?

Some points:

  1. What "verify" means is super ambiguous to me. It's easier to answer the question "does this PR need GPU testing?", so I'd like the message to reflect that more.
  2. You need to include whatever the magic phrase is to get gpuCI to run. It's worth assuming people will see this message only irregularly, and might not remember. I have no idea what the magic phrase is supposed to be: at one point it was "add to allowlist", over here Julia wrote "ok to test". I also don't know if making a typo, or including extra text about other stuff in the same comment would cause the command not to be recognised. So the bot's message should include what you need the admin to say in response.
  3. The last thing is that people need to know whether they are an admin for GPU permissions or not. Is this obvious? Do all the maintainers have these permissions, or just some of them? I would expect anyone with merge rights to a repository should also be able to trigger the gpuCI (so if this is not the case, we need to make it super clear somewhere what the deal is)
GenevieveBuckley commented 3 years ago

@quasiben that Jenkins plugin you link to has a note at the top saying "You should probably migrate to GitHub Branch Source Plugin". I think it might be obsolete?

Here's the link to the GitHub Branch Source plugin they suggest instead: https://plugins.jenkins.io/github-branch-source/

quasiben commented 3 years ago

I'm more asking if the above was appropriate or good enough. Because the message is global and shared with RAPIDS I think we are looking for something short and to the point. The above was the suggestion from the OPs team running/maintaining gpuCI

What "verify" means is super ambiguous to me. It's easier to answer the question "does this PR need GPU testing?", so I'd like the message to reflect that more.

This is slightly challenging as authors don't know what PRs may or may not need testing . This is came up more frequently in the past (less so now) and as such we can run every PR on gpuCI. So I think it's less a question of does it need GPU testing to will this PR break gpu testing. By running every PR on gpuCI we don't have to answer this question. I agree that verify is ambiguous -- vouch is a bit too formal, can you think of a better word here ?

You need to include whatever the magic phrase is to get gpuCI to run. It's worth assuming people will see this message only irregularly, and might not remember. I have no idea what the magic phrase is supposed to be: at one point it was "add to allowlist", over here Julia wrote "ok to test". I also don't know if making a typo, or including extra text about other stuff in the same comment would cause the command not to be recognised. So the bot's message should include what you need the admin to say in response.

We tried to address this with updated documentation in https://github.com/dask/dask/pull/7985 which has since been merged in: https://docs.dask.org/en/latest/develop.html#gpu-ci. We can ask OPs to include the possible answers here but again we need to share the messaging

The last thing is that people need to know whether they are an admin for GPU permissions or not. Is this obvious? Do all the maintainers have these permissions, or just some of them? I would expect anyone with merge rights to a repository should also be able to trigger the gpuCI (so if this is not the case, we need to make it super clear somewhere what the deal is)

This is not obvious. The list is manually maintained and is currently limited to a mix of dask maintainers and RAPIDS engineering:

charlesbluca
quasiben
pentschev
jacobtomlinson
jrbourbeau
jsignell
martindurant
jcrist
jakirkham
mrocklin
TomAugspurger

I think we can also make this clear with documentation and Dask/RAPIDS folks can act liaisons to adding additional members as gpuCI admins. What do you think ?

jakirkham commented 3 years ago

Maybe we can add the admin users to a GitHub team? Perhaps that would make it more transparent.

GenevieveBuckley commented 3 years ago

The info in the docs helps, thank you!

charlesbluca commented 3 years ago

Maybe we can add the admin users to a GitHub team? Perhaps that would make it more transparent.

Definitely in favor of doing this, considering the current group that is pinged for gpuCI access issues (dask/gpu) has a few people in it who aren't actually admins

jsignell commented 3 years ago

I would love to have the appropriate response included in the message. I've always appreciated how conda-forge does that.

jakirkham commented 3 years ago

I would love to have the appropriate response included in the message. I've always appreciated how conda-forge does that.

This gets tricky (even in conda-forge) as it can be accidentally parsed as approval to do the action, which requires a fair bit of cleverness to avoid. We (conda-forge) have gone through this pain firsthand. I don't really want to push others to figure that out (RAPIDS & Dask included)

jsignell commented 3 years ago

Ah I definitely did not appreciate that. Ok I am fine with things as they stand then.

GenevieveBuckley commented 3 years ago

I didn't realize that either. Ok, well so long as it's written down in the developer docs, and maintainers know where to look up that info again - that should work out fine.

quasiben commented 2 years ago

I think this has since been resolved -- closing