Closed quasiben closed 2 years ago
@jrbourbeau is out today. I have no idea who else is capable of doing a release (I am not). If this is broken I suggest a revert regardless of the release. If it's on main we can even do a non-HEAD release with this revert and will not block any other PRs.
I believe @jakirkham can do a release but will not be around until PST wakes up. @pentschev can you submit a PR to revert https://github.com/dask/distributed/pull/5487 ?
If John is able to, @fjetter are you ok with us push a release out ?
@martindurant @jsignell should they have comments as well
I don't expect this to block Distributed's CI, so far I was only able to reproduce this with 4+ GPUs, and gpuCI is limited to single-GPU only today.
Opened https://github.com/dask/distributed/pull/5505 to revert the PR now.
If John is able to, @fjetter are you ok with us push a release out ?
I'm not thrilled about this but I won't resist. I'm just on the fence since there are so many critcial stability problems on main lately and we're still going through with releases. UCX stuff affects a smaller subset of people even. However, critical bugfixes are worth a hotfix release and that shouldn't hurt anyone
I can't help but wonder if this could've been avoidable if we changed our release process, e.g. do not release anything that hasn't been on main for X days or if there are any external tests we should consult before releasing? see also #200
I'm not thrilled about this but I won't resist. I'm just on the fence since there are so many critcial stability problems on main lately and we're still going through with releases. UCX stuff affects a smaller subset of people even. However, critical bugfixes are worth a hotfix release and that shouldn't hurt anyone
I agree with this generally. This week however is when RAPIDS is starting its burndown cycle and we are pinning to the latest Dask release. I think this is one where we probably should have tested more on our end before merging in.
I can't help but wonder if this could've been avoidable if we changed our release process, e.g. do not release anything that hasn't been on main for X days ...
Not release process-wide, but this could have been avoided if we had multi-GPU testing in CI, which is challenge I've been trying to keep up with by running my own "multi-GPU CI" nightly.
... or if there are any external tests we should consult before releasing?
I think this is one where we probably should have tested more on our end before merging in.
Again, this is a problem with the limitation in CI. In a perfect world I would love to have multi-GPU CI available, but unfortunately this isn't a reality today. Thus, I'm trying to keep up with those issues myself, but lately I've been working on fixing so many UCX issues (not all in Dask/Distributed) that this one ended up being stationed in my queue for a few days before getting to debug and realize the root of the problem, apologies for the trouble I caused here.
apologies for the trouble I caused here.
Don't worry. I'm just wondering how we can improve for the future.
Benjamin Zaitlen @.***> writes:
@martindurant @jsignell should they have comments as well
Just chiming in to say @martindurant is out this week
I'm ok with reverting and releasing. But I agree with @fjetter that once this has been solved, we should follow up about how to fix the release system more holistically (I commented over at #200).
Thanks all. I've merged in dask/distributed#5487 and @jakirkham will be starting the release process shortly
Tagged and uploaded to PyPI. Will work on getting out conda-forge packages now.
conda-forge packages are up. Went a bit slower than expected as there were some CDN issues.
Thanks @jakirkham for getting this in. And thank you folks for the quick responses here. I'll close this issue
Looks like I picked a bad day to be out of the office : ) Thank you to everyone here for handling this
This morning @pentschev detected a critical bug around UCX and connection handling after https://github.com/dask/distributed/pull/5487 was merged in. How would folks feel if we reverted https://github.com/dask/distributed/pull/5487 and pushed out a release today, before the serialization PRs went in ?
cc @jrbourbeau @fjetter @jakirkham