dask / dask-blog

Dask development blog
https://blog.dask.org/
30 stars 35 forks source link

Change GPU scheduler article #164

Closed mrocklin closed 1 year ago

mrocklin commented 1 year ago

This article https://blog.dask.org/2023/04/14/scheduler-environment-requirements includes statements like the following:

If you use value-add hardware on the client and workers such as GPUs you’ll need to ensure your scheduler has one

This statement can be misleading. It's very true for RAPIDS work, but generally less true for PyTorch or other GPU work. (here is a pretty typical example).

I've fielded a bunch of questions on this topic. Here is an example. I think that we should alter this blog post to talk more about how serialization works. This article is causing non-trivial confusion among general, non-RAPIDS, GPU users.

cc @jacobtomlinson @quasiben @ntabris

mrocklin commented 1 year ago

Just to give more color here, at least two engineers within Coiled read this article and thought "oh, the scheduler has to have a GPU now or else things won't work". That resulted in them telling other engineers and some users they had to rewrite their code when they actually didn't. I found myself having to un-say the things said in this article, even to fairly sophisticated engineers who know a lot about how things work.

jacobtomlinson commented 1 year ago

It was a super hard line to walk between being hand-wavey and saying "you might need a GPU on the scheduler in more situations than previously" and being super rigid and saying "You MUST have a GPU on your scheduler at all times". @rjzamora, @fjetter and I debated this a bunch and settled on giving firm best practice guidance that if you have workers/clients with GPUs you should ensure you have a small GPU on your scheduler.

It felt better to give strong guidance and signal that folks can go off-road if they like, but it sounds like we failed at that. Of course folks can run the scheduler without a GPU, in the same way that they could have a totally different version of NumPy on the scheduler. We just wanted to clearly signal that here-be-dragons. Generally, if you're going to the effort or running GPUs on your client and workers it feels like an over-optimization to run the scheduler without a GPU, it doesn't have to be an A100, a T4 will guard against the majority of serialization bugs we will face with the recent changes.

I would love to see more content around this coming from non-RAPIDS folks and if someone wants to go back in and revise this post then I'd be happy to review it. Alternatively, we could expand documentation around this or create a follow-on post to talk about the PyTorch task-scheduling use case where you don't need a GPU on the scheduler that you've highlighted here.

fjetter commented 1 year ago

From a UX perspective I think it's much easier to tell people to keep the hardware and software aligned since the way we're doing serialization is not straight forward and I doubt the ordinary user understands this. Serialization issues are hard for non-expert users. I also have to admit that even I belonged to the confused engineers Matt is talking about. Serialization issues in general are hard to communicate, particularly to novice users.

I get that the language on the blog post is very strong. I'm OK if we want to roll this back a little to something like "strongly recommended". However, I think it requires a bit of domain knowledge to make this work (lazy imports, no data serialized, etc.) and most users will be better off with just aligning hardware.

mrocklin commented 1 year ago

From a UX perspective I think it's much easier to tell people to keep the hardware and software aligned since the way we're doing serialization is not straight forward and I doubt the ordinary user understands this. Serialization issues are hard for non-expert users

I don't object to this as a general encouragement. We would want this to be more general than the GPU scheduler conversation though, and more about packages more generally I think.

This blogpost is very focused on "you need GPU schedulers now" which is wrong, or at least commonly wrong. The only update that occurred affects people who used to have a GPU client and GPU workers but not a GPU scheduler. This seems very fairly to me, and specific I think only to RAPIDS workloads. However, many non-RAPIDS folks saw this blogpost and got the wrong idea.

My current thinking is that we should remove this article until something better gets put up. If folks want to message more about the virtues of greater consistency I'm +1 on that. Right now my sense is that this article is doing more harm than good.

mrocklin commented 1 year ago

Proposed PR moving it back to draft state until someone improves the article: https://github.com/dask/dask-blog/pull/165

jacobtomlinson commented 1 year ago

This blogpost is very focused on "you need GPU schedulers now" which is wrong, or at least commonly wrong.

I would push back pretty hard on this. I would argue that the majority of Dask GPU users are using libraries like cupy and cudf where this advice is pretty important. I agree that for folks using Dask with Pytorch may be able to avoid this, but that's likely to be the minority case.

My current thinking is that we should remove this article until something better gets put up.

Reactions like this remove basically all my motivation to engage here.

mrocklin commented 1 year ago

but that's likely to be the minority case.

We likely see different populations.

Reactions like this remove basically all my motivation to engage here

I'm confused by this. We're publishing a thing that gives incorrect advice. I think that it makes sense to revert the thing that we've published until we have something that is correct. I view this article in the same way that I would view a bug that was introduced. I'm fielding the downstream effects of the bug and I want the pain to stop. If the person who introduced the bug can fix the bug then great. If the person doesn't have time to fix the bug that's ok, we'll just revert.

Based on your reaction here my sense is that you see it differently. I'm sorry if my response offended you in some way. This is, to me, the normal response to a situation like this.

jacobtomlinson commented 1 year ago

I view this article in the same way that I would view a bug that was introduced.

I view this article as a PR that fixed one bug but introduced another. Reverting still causes pain by reintroducing the problem.

I've opened #166 to start the conversation around how we can roll forward onto something that fixes both problems.