dask / community

For general discussion and community planning. Discussion issues welcome.
20 stars 3 forks source link

Possible Release Today ? #201

Closed quasiben closed 2 years ago

quasiben commented 2 years ago

This morning @pentschev detected a critical bug around UCX and connection handling after https://github.com/dask/distributed/pull/5487 was merged in. How would folks feel if we reverted https://github.com/dask/distributed/pull/5487 and pushed out a release today, before the serialization PRs went in ?

cc @jrbourbeau @fjetter @jakirkham

fjetter commented 2 years ago

@jrbourbeau is out today. I have no idea who else is capable of doing a release (I am not). If this is broken I suggest a revert regardless of the release. If it's on main we can even do a non-HEAD release with this revert and will not block any other PRs.

quasiben commented 2 years ago

I believe @jakirkham can do a release but will not be around until PST wakes up. @pentschev can you submit a PR to revert https://github.com/dask/distributed/pull/5487 ?

If John is able to, @fjetter are you ok with us push a release out ?

@martindurant @jsignell should they have comments as well

pentschev commented 2 years ago

I don't expect this to block Distributed's CI, so far I was only able to reproduce this with 4+ GPUs, and gpuCI is limited to single-GPU only today.

pentschev commented 2 years ago

Opened https://github.com/dask/distributed/pull/5505 to revert the PR now.

fjetter commented 2 years ago

If John is able to, @fjetter are you ok with us push a release out ?

I'm not thrilled about this but I won't resist. I'm just on the fence since there are so many critcial stability problems on main lately and we're still going through with releases. UCX stuff affects a smaller subset of people even. However, critical bugfixes are worth a hotfix release and that shouldn't hurt anyone

I can't help but wonder if this could've been avoidable if we changed our release process, e.g. do not release anything that hasn't been on main for X days or if there are any external tests we should consult before releasing? see also #200

quasiben commented 2 years ago

I'm not thrilled about this but I won't resist. I'm just on the fence since there are so many critcial stability problems on main lately and we're still going through with releases. UCX stuff affects a smaller subset of people even. However, critical bugfixes are worth a hotfix release and that shouldn't hurt anyone

I agree with this generally. This week however is when RAPIDS is starting its burndown cycle and we are pinning to the latest Dask release. I think this is one where we probably should have tested more on our end before merging in.

pentschev commented 2 years ago

I can't help but wonder if this could've been avoidable if we changed our release process, e.g. do not release anything that hasn't been on main for X days ...

Not release process-wide, but this could have been avoided if we had multi-GPU testing in CI, which is challenge I've been trying to keep up with by running my own "multi-GPU CI" nightly.

... or if there are any external tests we should consult before releasing?

I think this is one where we probably should have tested more on our end before merging in.

Again, this is a problem with the limitation in CI. In a perfect world I would love to have multi-GPU CI available, but unfortunately this isn't a reality today. Thus, I'm trying to keep up with those issues myself, but lately I've been working on fixing so many UCX issues (not all in Dask/Distributed) that this one ended up being stationed in my queue for a few days before getting to debug and realize the root of the problem, apologies for the trouble I caused here.

fjetter commented 2 years ago

apologies for the trouble I caused here.

Don't worry. I'm just wondering how we can improve for the future.

douglasdavis commented 2 years ago

Benjamin Zaitlen @.***> writes:

@martindurant @jsignell should they have comments as well

Just chiming in to say @martindurant is out this week

jsignell commented 2 years ago

I'm ok with reverting and releasing. But I agree with @fjetter that once this has been solved, we should follow up about how to fix the release system more holistically (I commented over at #200).

quasiben commented 2 years ago

Thanks all. I've merged in dask/distributed#5487 and @jakirkham will be starting the release process shortly

jakirkham commented 2 years ago

Tagged and uploaded to PyPI. Will work on getting out conda-forge packages now.

jakirkham commented 2 years ago

conda-forge packages are up. Went a bit slower than expected as there were some CDN issues.

quasiben commented 2 years ago

Thanks @jakirkham for getting this in. And thank you folks for the quick responses here. I'll close this issue

jrbourbeau commented 2 years ago

Looks like I picked a bad day to be out of the office : ) Thank you to everyone here for handling this