dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.56k stars 715 forks source link

Dask Stream Closed Error with Bayesian Optimization #807

Open stevenagajanian opened 7 years ago

stevenagajanian commented 7 years ago

hi all,

I've been trying to create a back testing framework for a time series analysis that involves tuning a random forest with Bayesian optimization monthly for the last 10 years. each month is independent of all others so they can be tested at the same time so I've been trying to use dask to distribute all of the training/predictions. when run linearly without dask, the task completes with no errors. however with dask I get a stream closed error for some of the months. I've found the number of months that complete without error is inversely proportional to the number of iterations that Bayesian optimization runs (which essentially makes each month run longer). 60 Bayesian optimization iterations for each month resulted in 32/132 months being completed, and 30 iterations resulted in 64/132. the Bayesian optimization library I'm using is at https://github.com/fmfn/BayesianOptimization does anyone have any idea why this might be happening? Is this an issue with dask, some config issue or with the BO library incompatibility?

mrocklin commented 7 years ago

There are a number of stability improvements in https://github.com/dask/distributed/pull/804 that might resolve what you're experiencing. I plan to merge these and issue a micro-release soonish.

In the mean time you might want to try out that branch and see if that has an effect.

pip install git+https://github.com/mrocklin/distributed.git@worker-dep-state --upgrade
stevenagajanian commented 7 years ago

sorry for the late reply, but i wanted to give an update. I turned off the Bayesian optimization and replaced it with an hour long sleep. at the end of each of my 132 processes i have each one upload some results to mongo and print output to the logs. however i saw that the task had theoretically finished because everything was zeroed out on the Bokeh UI. but when i checked mongo there were only 58/132 entries, and the output logs didnt show anything they were supposed to. furthermore even though theoretically all tasks had been finished, i didnt have control over the terminal as if processes were still running. Do you have any idea why this might be the case? is there a timeout for each process? And if there is a timeout shouldn’t control over the terminal be returned? and would any of this affect output to the logs. thank you!

mrocklin commented 7 years ago

Can you verify with which version this error occurred?

import distributed
print(distributed.__version__)

If the answer is less than 1.15.1 then you might want to update.

conda install -c conda-forge distributed
stevenagajanian commented 7 years ago

my version is definitely lower, ill check if i can update thank you for the quick reply

mrocklin commented 7 years ago

Version 1.15.0 introduced some stability issues that 1.15.1 resolved. What you describe is consistent with those issues.

On Fri, Jan 13, 2017 at 1:12 PM, stevenagajanian notifications@github.com wrote:

my version is definitely lower, ill check if i can update thank you for the quick reply

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/807#issuecomment-272506948, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszBBHGByXqaHp4Wf3wwW2f-sqQDcdks5rR75_gaJpZM4Lg9Py .