Distributed core logic maintenance

Vesyrak commented 10 months ago

Hey all,

I was wondering how much of the core logic of the entire deploy part is still maintained. I've called out one issue (https://github.com/dask/distributed/issues/8119) and created several PR's ( https://github.com/dask/distributed/pull/8428, https://github.com/dask/distributed/pull/8271, https://github.com/dask/distributed/pull/8428, https://github.com/dask/distributed/pull/8273) which have gotten zero attention, which is too bad as I would like to fix the issues upstream (or if I'm using it incorrectly, fix my issues). This, combined with git-blames which show barely any recent relevant changes make me wonder how much time is invested here.

As we are currently building upon your distributed framework, I would like to hear your vision and opinions of your focus on the follow-up for distributed deployment and optimisation. This to prevent building something which might eventually be phased-out.

mrocklin commented 10 months ago

Hi, the code is certainly maintained, but probably folks are less excited about pure code rearrangements. There's a lot of activity going on and not a ton of people to do the work.

Some ways to get reviewers time might include motivating the work a bit more by talking about what specific problems you're running into and why they're important. While reducing code complexity is valuable, any change this deep in the system is likely to have knock on effects, so you're asking people to both deprioritize what they're working on currently, and also sign up for a couple days in the future cleaning up whatever gets unearthed by such a change. Speaking for the folks whose priorities I influence I'll say that I'm so far less excited by these proposed changes than by their current priorities (query optimization, task ordering rewriting parquet). I could be wrong though (hopefully we can find better and more important work) but it probably takes some effort to make that argument though.

Regardless, my apologies that no one responded, even if to say "sorry, busy". Certainly you're owed that.

Cc'ing also @jacobtomlinson who I think cares a lot about maintaining a welcoming environment.

Vesyrak commented 10 months ago

Hey, thanks for the fast response!

I wholeheartedly agree that there should be focus on the more "pressing" issues, and that this can be moved to the back. The PR's I defined certainly aren't going to change the world for the better, as they're mainly based on my current findings. The question mainly arose due to the fact that I'm doing some resource optimisation on top of your distributed engine, and have confused myself multiple times with some of the distributed implementation details, caused by broken documentation, confusing naming and inconsistent APIs. Which, ofcourse, aren't public APIs, but still. Most complexities are internally monkeypatched, but as I generally dislike this, I'd like to contribute upstream, and not have to keep a fork synchronized.

I agree that we should also be wary for unforeseen consequences, but this should be resolved with more system tests. Distributed systems are generally very error-prone and have long-reaching effects, and its exactly because of this a case can be made for more tests. I would be happy to add some to the proposed PRs, given a bit of direction.

I appreciate the apology, but I want to apologise myself if I came on too strong as I did not intend to ask for an apology, just for a frame of reference for the future of distributed's deploy functionality.

Kind regards,

jacobtomlinson commented 9 months ago

Firstly I want to say sorry that we dropped the ball on reviewing your PRs, having them sit without attention for so long isn't a good contributor experience. I appreciate that you weren't asking for an apology, but one is definitely warranted.

I had a chat with @fjetter earlier about these PRs specifically and he and I are the code owners of the things you are proposing changes to. There seems to be a general theme with these PRs that makes them a little hard to review so I wanted to mention a few high level things here:

It's not clear why some of these changes are beneficial. In a project as large as Dask we tend to need measurable reasons to motivate code changes. Is something broken? Is something performing badly? When one of us hits merge on an external contribution we own the consequences of that breaking things, which happens a lot. So sometimes we need a little persuasion that a change is worth making.
Tests give us confidence in our code. None of your PRs add any tests and seem to result in existing tests failing. Ideally we should be able to easily describe the problem we are trying to solve as a test that ensures it has been solved and that it doesn't get unsolved in the future.
Dask historically doesn't have a clean separation between public/private APIs. It's something we are working on but right now we tend to treat documented code as public and the rest as private. However if something doesn't start with an _ folks may assume it is public and use it, so we need to be more conscious going forward around introducing new public APIs.

But I'm afraid we just don't have the capacity to coach folks through contributions here right now, so to increase chances of things getting merged I encourage you to spend some time adding tests to your PRs and adding more motivation to the problems they solve.

Vesyrak commented 9 months ago

Hey, thanks for the feedback!

I can see the requirement for better motivating the PR's. The current motivation are generally improving readability/extensibility and small bugfixes, but I will try to elaborate within each PR individually. These generally resolve issues which I have repeatedly encountered when extending the Adaptive class to add more intelligence to the approach of deploying and managing workers. I would like to propose some of our changes to upstream to allow the larger community to benefit from them, following the heart of OSS development. That is, of course, if you think those proposed changes would not be drawbacks for some of the community. But these additions depend on some of the refactors I proposed here, and will be proposed when they are tested sufficiently by us internally.

The tests, to my experience at least, appear to be a bit confusing and/or flaky locally and on github, but I'll take a look at them again, as failing tests should not be allowed in any case indeed.

Nonetheless, my main question (is the deploy functionality still maintained and kept long-term) is answered, and I can close this issue.

Thank you for the collaboration!

dask / distributed

Distributed core logic maintenance #8432