Scheduler dying during task runs

shughes-uk commented 2 years ago

A customer is experiecing the scheduler dying after running tasks successfully for a while (possibly a deadlock)

Example cluster that died https://cloud.coiled.io/julianfb51/clusters/36106/2/details

Another cluster that ran the same number of jobs successfully https://cloud.coiled.io/julianfb51/clusters/32825/details

The user has upgraded pandas between runs.

The user code runs relatively quickly locally, reports lots of log spamming task stats and scheduler CPU soft locks.

Workers get a 'BrokenPipe' error too.

@phobson

jfb51 commented 2 years ago

Hi @phobson, I'm the user, let me know if I can provide more detail

jfb51 commented 2 years ago

Hi there, any luck so far?

phobson commented 2 years ago

@jfb51 could you provide some information as to how this error comes up? What kind of code are you running? How big is the task graph? Which dask API are you using? What's the failure rate?

dchudz commented 2 years ago

@phobson I think we know what this is about and are working on a fix. Don't need info from jfb51.

ian-r-rose commented 2 years ago

Hi @jfb51, thanks for the report! To try to synthesize Paul and David's replies: I do think we have a fix for the scheduler dying, and we'll get that out ASAP. However, we are interested in what kind of workflows you are running so that we can better understand how they resulted in the scheduler instability. Would you feel comfortable sharing any representative snippets or reproducible code with us?

jfb51 commented 2 years ago

Hi @ian-r-rose, that's excellent, please let me know when I can give that a try! Let me try pull together a reproducible example.

jfb51 commented 2 years ago

OK, so, embarrassingly I now think this is 100% a problem on my side, but not sure of the root cause.

What I'm doing is running simulations for cricket matches. For a given match, one of the inputs to the simulation is the "career" of each of the players playing in the match (runs scored, matches played, etc. before this game starts).

The career input for a given match is a 22x130 pandas dataframe (there are 2 teams of 11 players in cricket, and 130 attributes which I've collected for each player). One weird thing is that one of these dataframes is apparently 520 MB. Weirder is that the 'parent' dataframe containing all matches that I sample from, which a is 95355 x 130 dataframe, is only 440MB. When this was working recently I was passing in dataframes which would be about 440MB/(95355/22) ~ 100KB, so no issue for the scheduler. Now that the size of this input has ballooned to 520MB it's obviously causing problems, but as I say I don't understand how such a small dataframe can be this big, if you accept my claim that there are no enormous individual elements of the dataframe, just strings, ints, floats etc.

jfb51 commented 2 years ago

batters = []
bowlers = []

for m in s_and_m: (m is just a tuple, s_and_m is a list of 107 of these tuples)
        batters.append(career_batting_data.xs(m[1], level=1))
        bowlers.append(career_bowling_data.xs(m[1], level=1))

from pympler import asizeof

print(asizeof.asizeof(career_bowling_data))
print(career_bowling_data.shape)
print(asizeof.asizeof(bowlers))
print(asizeof.asizeof(bowlers[0]))
print(bowlers[0].shape)

440746536
(95355, 130)
7772732248
523128496
(22, 130)

jfb51 commented 2 years ago

Clearly not a dask/distributed problem for sure

phobson commented 2 years ago

Interesting. How many dataframe are you in your e.g., bowlers list? Are you concatenating that back into a single dataframe?

phobson commented 2 years ago

hey @jfb51 just checking on this. Any luck yet?

jfb51 commented 2 years ago

Hi @phobson, thanks for the bump - the both the bowlers and batters list are 89 elements of 22x130 dataframes and remain in the list rather than being concatted back together at any stage, so still no real idea why any one element of these lists is so enormous. Have worked around it by converting these into dicts instead.

coiled / feedback

Scheduler dying during task runs #171