dask / community

For general discussion and community planning. Discussion issues welcome.
20 stars 3 forks source link

Release 2021.10.0 #189

Closed jrbourbeau closed 2 years ago

jrbourbeau commented 2 years ago

With our usual two week release cadence, we would normally release tomorrow. However, @fjetter has been able to identify several fixes which should resolve the cluster deadlock issue flagged right before the last release (xref https://github.com/dask/community/issues/182) and was also reported by another user in https://github.com/dask/distributed/issues/5366:

I'd prefer to bump the 2021.10.0 release to next Friday, October 15, to give us time to get those fixes in and confirm they indeed resolve https://github.com/dask/distributed/issues/5366.

Additionally, since the last release we've merged a large refactor to the distributed worker state machine (xref https://github.com/dask/distributed/pull/5046) which resulted in some follow-up work (https://github.com/dask/distributed/pull/5316). @fjetter @crusaderky, checking in, is there any other follow-up work related to the worker state refactor we should prioritize before releasing, or are we okay on that front?

@quasiben you mentioned on the community call earlier today that RAPIDS is in a code freeze and a release is planned for today. I suspect this means bumping the dask + distributed release back a week is fine, but wanted to double check.

cc @jakirkham @jsignell

jakirkham commented 2 years ago

Thanks James! 😄

Not seeing any issue with postponing to next Friday, but will make sure others know and raise any concerns here

For RAPIDS 21.10, which is coming out soon, we are pinning to Dask + Distributed 2021.9.1. So the Dask + Distributed release shouldn't impact that

Ben please feel free to correct me on any of this 🙂

crusaderky commented 2 years ago

@fjetter @crusaderky, checking in, is there any other follow-up work related to the worker state refactor we should prioritize before releasing, or are we okay on that front?

None on my side

jrbourbeau commented 2 years ago

We've merged all current deadlock-related PRs, which seemed to have helped with the deadlock report offline, but unfortunately doesn't fully resolve the issue (there was a subsequent deadlock that took longer to trigger). There was a publicly reported deadlock issue, which we think is similar to the offline report (xref https://github.com/dask/distributed/issues/5366). I've commented here https://github.com/dask/distributed/issues/5366#issuecomment-944460518 to see if they're still encountering cluster deadlocking behavior with the latest main branch of distributed.

In order to not release a version of distributed which is known to deadlock, I'll suggest we bump releasing back another week. @jakirkham @quasiben is there any issue with this on your end?

jakirkham commented 2 years ago

Makes sense. Thanks for the update James. No issues on our end :)

fjetter commented 2 years ago

The deadlocks led me to https://github.com/dask/distributed/pull/5426 which seemed to fix the problems which were originally reported by a power user.

There is still an open issue about a deadlock but that appears to already affecting the current stable version 2021.09.01 https://github.com/dask/distributed/issues/5366

jrbourbeau commented 2 years ago

Thanks for all your efforts on this @fjetter. I'll suggest that, since https://github.com/dask/distributed/issues/5366 was already reported with a released version of distributed, we merge https://github.com/dask/distributed/pull/5426, which is a known improvement, and release

jakirkham commented 2 years ago

@pentschev mentioned this morning that there were some changes that were causing us some issues in RAPIDS. Peter are all of those fixed or are there outstanding issues that we should be addressing before releasing?

pentschev commented 2 years ago

They were just some minor changes in https://github.com/dask/distributed/pull/5438 and https://github.com/dask/distributed/pull/5446 that changed default/previous behavior so broke our tests, but nothing critical and both have been addressed in https://github.com/rapidsai/dask-cuda/pull/757 and https://github.com/rapidsai/dask-cuda/pull/758, respectively. Therefore, I think we're good to the extent I'm aware of. Thanks @jakirkham for the ping.

jrbourbeau commented 2 years ago

Thanks for flagging those dask-cuda issues @jakirkham and @pentschev for resolving them downstream.

https://github.com/dask/distributed/pull/5426 has been merged so I think we're in good shape to push out the 2021.10.0 release. I'll plan to push the release out in around an hour and will, as usual, ping this issue when I start that process.

jakirkham commented 2 years ago

Sounds good. Have mentioned internally as well. Will let you know if anything comes up. Otherwise think we should be good to go

jrbourbeau commented 2 years ago

🚀