Deduplicate scheduler requests in P2P

hendrikmakait commented 1 month ago

In P2P, a worker may send many concurrent requests to the scheduler if it is not aware of a specific shuffle run. This could lead to DDoS-like scenarios on large-scale clusters with P2P rechunking. This PR prohibits concurrent fetch requests.

[ ] Tests added / passed
[ ] Passes pre-commit run --all-files

jacobtomlinson commented 1 month ago

Last week @phofl mentioned that it would be great to get more reviews on PRs as that is currently a point of friction for folks actively working on Dask/Distributed.

However I find it hard to dive into reviewing a PR like this as it doesn't close an open issue and it doesn't have a description of what the change does or any examples of what this fixes. I'd love to help out more with reviews here, but I don't know where to begin with this one.

cc @fjetter

hendrikmakait commented 1 month ago

@jacobtomlinson, it looks like you wrote this just as I added a description. Is this helpful? (Sorry for having it "ready for review" without a description, lost my internet connection for a couple of minutes so the update to the description didn't go out in sync with the change in readiness.)

github-actions[bot] commented 1 month ago

Unit Test Results

_See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests._

25 files ± 0 25 suites ±0 10h 16m 54s :stopwatch: + 3m 10s 4 131 tests + 1 4 019 :white_check_mark: ±0 110 :zzz: ±0 2 :x: +1 47 718 runs +10 45 620 :white_check_mark: +8 2 096 :zzz: +1 2 :x: +1

For more details on these failures, see this check.

Results for commit 2257deb9. ± Comparison against base commit fa9806b5.

jacobtomlinson commented 1 month ago

Fair enough. I'd been digging through the code for 10 mins at the point you updated the description. This also isn't a one off, see https://github.com/dask/distributed/pull/8809, https://github.com/dask/distributed/pull/8852, https://github.com/dask/distributed/pull/8834 and https://github.com/dask/distributed/pull/8898 for some more recent examples.

I think my point is that I'd like to help out more with reviews, but it can be challenging.

fjetter commented 1 month ago

@jacobtomlinson The examples you're listing are surprising for me

https://github.com/dask/distributed/pull/8809 An updated warning message. I admit that the title is confusing but I don't know if this needs a more thorough description
https://github.com/dask/distributed/pull/8852 is a precommit update. There's nothing more to say, is there?
https://github.com/dask/distributed/pull/8834 has a decent title and a short description. Change is a hard coded config value
https://github.com/dask/distributed/pull/8898 I guess this is legit. This does not include any context for why this is useful but I also don't expect anybody else to dive right into task queuing

I understand some criticism about the first (bad title) and last (lacking context) but the others are fine, aren't they?

jacobtomlinson commented 1 month ago

That's fair, I just grepped through the last couple of pages of distributed looking for PRs with little/no context. I appreciate some of them may be trivial PRs, but I expect if I went digging I would find more examples like https://github.com/dask/dask-expr/pull/1146 or https://github.com/dask/dask-expr/pull/1138.

I'm very happy to help out with PR review if that's something that would be valuable, but folks outside of the Coiled team have a lot less context about why things are being done, so it would be really helpful to have more descriptive PRs.

fjetter commented 1 month ago

My sense is that generally we do provide decent descriptions. I don't really want to dig through the past and discuss bad issues or PRs now. I don't think this is the reason why few people are helping out. I'm happy to revisit this if this is a persistent issue

jacobtomlinson commented 1 month ago

I don't really want to dig through the past and discuss bad issues or PRs now

Thats totally fair

I don't think this is the reason why few people are helping out

I'm trying to feed back to you all that it's one of the reasons why I help out less than I used to. I hope you can be receptive to that. I'm also trying to communicate that I want to help more.

Sorry that this turned into such a big deal. My intent was to gently nudge so that I can get more involved with reviews. But it's blown up a bit so apologies for any bad feeling generated from this discussion.

hendrikmakait commented 1 month ago

I'm trying to feed back to you all that it's one of the reasons why I help out less than I used to. I hope you can be receptive to that. I'm also trying to communicate that I want to help more.

It's noted and appreciated. To me, this seems like your usual chicken-and-egg problem. With little external engagement around reviews or code contributions, there's little time for writing good descriptions (and less value in doing so). With bad descriptions, there's little external engagement.

Let's try to get to a minimal standard where we write something that's helpful but also doesn't take up much time. Personally, I'd say: Feel free to ping people on PRs you'd like to review but lack sufficient context. We'll just have to find a good middle ground here so that we don't blow the effort for documentation out of proportion.

jacobtomlinson commented 1 month ago

Personally, I'd say: Feel free to ping people on PRs you'd like to review but lack sufficient context.

This is a good suggestion!

dask / distributed

Deduplicate scheduler requests in P2P #8899

Unit Test Results