Closed shughes-uk closed 2 years ago
Hi @phobson, I'm the user, let me know if I can provide more detail
Hi there, any luck so far?
@jfb51 could you provide some information as to how this error comes up? What kind of code are you running? How big is the task graph? Which dask API are you using? What's the failure rate?
@phobson I think we know what this is about and are working on a fix. Don't need info from jfb51.
Hi @jfb51, thanks for the report! To try to synthesize Paul and David's replies: I do think we have a fix for the scheduler dying, and we'll get that out ASAP. However, we are interested in what kind of workflows you are running so that we can better understand how they resulted in the scheduler instability. Would you feel comfortable sharing any representative snippets or reproducible code with us?
Hi @ian-r-rose, that's excellent, please let me know when I can give that a try! Let me try pull together a reproducible example.
OK, so, embarrassingly I now think this is 100% a problem on my side, but not sure of the root cause.
What I'm doing is running simulations for cricket matches. For a given match, one of the inputs to the simulation is the "career" of each of the players playing in the match (runs scored, matches played, etc. before this game starts).
The career input for a given match is a 22x130 pandas dataframe (there are 2 teams of 11 players in cricket, and 130 attributes which I've collected for each player). One weird thing is that one of these dataframes is apparently 520 MB. Weirder is that the 'parent' dataframe containing all matches that I sample from, which a is 95355 x 130 dataframe, is only 440MB. When this was working recently I was passing in dataframes which would be about 440MB/(95355/22) ~ 100KB, so no issue for the scheduler. Now that the size of this input has ballooned to 520MB it's obviously causing problems, but as I say I don't understand how such a small dataframe can be this big, if you accept my claim that there are no enormous individual elements of the dataframe, just strings, ints, floats etc.
batters = []
bowlers = []
for m in s_and_m: (m is just a tuple, s_and_m is a list of 107 of these tuples)
batters.append(career_batting_data.xs(m[1], level=1))
bowlers.append(career_bowling_data.xs(m[1], level=1))
from pympler import asizeof
print(asizeof.asizeof(career_bowling_data))
print(career_bowling_data.shape)
print(asizeof.asizeof(bowlers))
print(asizeof.asizeof(bowlers[0]))
print(bowlers[0].shape)
440746536
(95355, 130)
7772732248
523128496
(22, 130)
Clearly not a dask/distributed problem for sure
Interesting. How many dataframe are you in your e.g., bowlers
list? Are you concatenating that back into a single dataframe?
hey @jfb51 just checking on this. Any luck yet?
Hi @phobson, thanks for the bump - the both the bowlers
and batters
list are 89 elements of 22x130 dataframes and remain in the list rather than being concatted back together at any stage, so still no real idea why any one element of these lists is so enormous. Have worked around it by converting these into dicts instead.
A customer is experiecing the scheduler dying after running tasks successfully for a while (possibly a deadlock)
Example cluster that died https://cloud.coiled.io/julianfb51/clusters/36106/2/details
Another cluster that ran the same number of jobs successfully https://cloud.coiled.io/julianfb51/clusters/32825/details
The user has upgraded pandas between runs.
The user code runs relatively quickly locally, reports lots of log spamming task stats and scheduler CPU soft locks.
Workers get a 'BrokenPipe' error too.
@phobson