Open apatlpo opened 6 years ago
Can you try starting your client with the following:
client = Client(..., timeout='20s')
and then seeing how the restart does?
For the file issue I'm not surprised to see things not work well on an NFS system. See http://dask.pydata.org/en/latest/setup/hpc.html#no-local-storage
If you have an NFS then maybe just use normal techniques like PYTHONPATH or installing modules rather than use upload_file
With 20s:
distributed.client - ERROR - Restart timed out after 20.000000 seconds
With 60s:
distributed.batched - INFO - Batched Comm Closed: in <closed TCP>: BrokenPipeError: [Errno 32] Broken pipe
tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: scheduler='tcp://10.148.0.106:8786' processes=33 cores=132>>
Traceback (most recent call last):
File "/home1/datahome/aponte/.miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/ioloop.py", line 1209, in _run
return self.callback()
File "/home1/datahome/aponte/.miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/client.py", line 846, in _heartbeat
self.scheduler_comm.send({'op': 'heartbeat'})
File "/home1/datahome/aponte/.miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/batched.py", line 106, in send
raise CommClosedError
the message repeats itself several times
How are you starting your workers? What kind of system are you on?
I am on a cluster PBS : https://wwz.ifremer.fr/pcdm/Equipement
I run ./launch-dask.sh ${N_WORK_NODES}
following https://pangeo-data.github.io/pangeo/setup_guides/cheyenne.html
It was working fine until a recent update xarray/dask/...
It was working fine until a recent update xarray/dask/...
Would you be willing to use git bisect
to find the cause? "No" is a fine answer
I started doing this but am right away confused.
The working version is dask 0.15.4 and distributed 1.19.3
The failing version is dask 0.17.2 and distributed 1.18.1+288.g31c470e
Isn't 31c470e based on the latest distributed version?
Correct. You would only be bisecting in the distributed repository. You would no longer use https://github.com/dask/distributed/commit/31c470e83ca5c66787be151e9c39f1f9b29f523a at all, instead you would bound your search by a commit that you knew worked (like version 1.19.3) and a commit that you knew did not work (like version 1.21.4) and then you would use git bisect to search within that range to find the offending commit. This process is very useful, but can be confusing at first.
On Wed, Mar 28, 2018 at 8:53 AM, Aurélien Ponte notifications@github.com wrote:
I started doing this but am right away confused.
The working version is dask 0.15.4 and distributed 1.19.3
The failing version is dask 0.17.2 and distributed 1.18.1+288.g31c470e
Isn't 31c470e https://github.com/dask/distributed/commit/31c470e83ca5c66787be151e9c39f1f9b29f523a based on the latest distributed version?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-376875575, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszDm3Lb4BTjv7G2oi3GHRkfuGmeaLks5ti4fGgaJpZM4S-iLv .
Am I supposed to this simultaneously for dask AND distributed? If yes, I am not sure how I can keep consistent versions of both libraries while bisecting.
Hopefully there is enough flexibility that everything will work smoothly with a single recent version of Dask.
Also, you have no obligation to do this. It would be useful, but if it proves to be too difficult then we can probably find the issue some other way with more time.
On Wed, Mar 28, 2018 at 9:29 AM, Aurélien Ponte notifications@github.com wrote:
Am I supposed to this simultaneously for dask AND distributed?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-376886776, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszBsCZDf9gIZMharnnDTYxHL8gdl1ks5ti5AwgaJpZM4S-iLv .
Hum, distributed version 1.19.3 does not seem to be compatible with latest dask.
The combination does not even produce a scheduler.json
The following error on the scheduler side occurs:
File "/home1/datahome/aponte/.miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home1/datahome/aponte/.miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home1/datahome/aponte/.miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home1/datahome/aponte/distributed/distributed/cli/dask_mpi.py", line 57, in main
services=services)
File "/home1/datahome/aponte/distributed/distributed/scheduler.py", line 376, in __init__
**kwargs)
File "/home1/datahome/aponte/distributed/distributed/node.py", line 37, in __init__
deserialize=deserialize, io_loop=self.io_loop)
File "/home1/datahome/aponte/distributed/distributed/core.py", line 117, in __init__
pc = PeriodicCallback(self.monitor.update, 500, io_loop=self.io_loop)
TypeError: __init__() got an unexpected keyword argument 'io_loop'
Terminated
argh, this is a distributed-tornado compatibility issue in fact, see #1172
The bisect seems difficult to me
Unless you see an easier alternative, I'll try a manual bisect with conda install to narrow down where the problem arise
Versions may not be sufficiently granular to identify the problem. Ideally we would be able to find the exact commit in git where the error was introduced. That may prove to be too difficult though.
On Wed, Mar 28, 2018 at 2:44 PM, Aurélien Ponte notifications@github.com wrote:
Unless you see an easier alternative, I'll try a manual bisect with conda install to narrow down where the problem arise
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-376993793, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszEYw4PuphFfkLkgXTKBqn-CZc-Wqks5ti9n2gaJpZM4S-iLv .
I am not sure this is a good news but I find that problems arise as early as dask v0.16.0. Here are things that change from 0.15.4 to 0.16.0 for my deployments:
distributed.client - ERROR - Restart timed out after 5.000000 seconds
. Right after restart, the client output shows 0 workers but eventually, workers come back up.client.upload_file('utils.py')
breaks with the following error: distributed.utils - ERROR - [Errno 2] No such file or directory: ...
(same as reported above)Does it help? How can I diagnose what may be happening?
Going back to the beginning here a bit. Can you verify that your workers actually connect and work normally before you try to restart? Is it just restarting that is the problem?
I believe they do, i.e. I get the right number of workers and I can run computations that do not involve calls to utils.py. The upload_file still breaks however but I want to emphasize I did not try your suggestions about PYTHONPATH or module install approach.
I attach a log file for one of the worker (note this is for dask 0.16.0): dask-worker.o965502.log
@rabernat , have you encountered issues (slower responsiveness, restart time outs, unability to use upload_file) on habanero with dask versions above 0.16.0 ?
I have reverted to the latest version of dask (0.17.2 / distributed 1.21.4).
Here is the log for one worker: dask-worker.o966682.log
I am not sure what else I could try.
In my experience, restarting workers on a cluster has never actually worked reliably. The minute I have a worker die, I usually just kill the whole cluster and start over. I could never summon the motivation to debug this deeply.
However, your experience sounds worse than what I am used to. A properly configured HPC cluster can perform very well, provided that the workers don't run out of memory or experience errors causing them to die.
This does not sound too optimistic.
I reverted to using latest distribution (04e7e7a7d6d3cee11ef2a4f2f4d7622d9a21e2eb) which includes the #1865.
Here is the behavior that I observe now:
(pangeon) aponte@datarmor2:~/iwave_sst/datarmor> grep worker-i0oaruup dask*
dask-worker.o967959:FileNotFoundError: [Errno 2] No such file or directory: '/home1/datawork/aponte/dask/worker-i0oaruup/storage'
dask-worker.o967962:distributed.diskutils - WARNING - Found stale lock file and directory '/home1/datawork/aponte/dask/worker-i0oaruup', purging
You might consider not using your home directory for scratch space. I recommend using the --local-directory keyword to dask-worker (however you're launching dask-worker). Some more information here: http://dask.pydata.org/en/latest/setup/hpc.html#no-local-storage
On Fri, Mar 30, 2018 at 10:02 AM, Aurélien Ponte notifications@github.com wrote:
This does not sound too optimistic.
I reverted to using latest distribution (04e7e7a https://github.com/dask/distributed/commit/04e7e7a7d6d3cee11ef2a4f2f4d7622d9a21e2eb) which includes the #1865 https://github.com/dask/distributed/pull/1865.
Here is the behavior that I observe now:
- workers show up in dask dash board but in jupyter notebooks, client tells me however that there are no workers
- I have multiple FileNotFoundError errors in worker logs that seem to indicate that workers are deleting each other work directories. Example:
(pangeon) aponte@datarmor2:~/iwave_sst/datarmor> grep worker-i0oaruup dask* dask-worker.o967959:FileNotFoundError: [Errno 2] No such file or directory: '/home1/datawork/aponte/dask/worker-i0oaruup/storage' dask-worker.o967962:distributed.diskutils - WARNING - Found stale lock file and directory '/home1/datawork/aponte/dask/worker-i0oaruup', purging
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-377530632, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszLYcDKV7CI3JlMtoKkOc5uIoqmRMks5tjjrlgaJpZM4S-iLv .
I am using a work directory where I have a comfortable amount of space with the following command:
mpirun --np 7 dask-mpi --nthreads 4 \
--memory-limit 1e10 \
--interface ib0 \
--local-directory ${DATAWORK}/dask \
--scheduler-file=${SCHEDULER}
Ideally this would live on locally attached storage. Network storage isn't particularly good for frequently used scratch space like this, no matter how large it is.
On Fri, Mar 30, 2018 at 10:09 AM, Aurélien Ponte notifications@github.com wrote:
I am using a work directory where I have a confortable amount of space with the following command:
mpirun --np 7 dask-mpi --nthreads 4 \ --memory-limit 1e10 \ --interface ib0 \ --local-directory ${DATAWORK}/dask \ --scheduler-file=${SCHEDULER}
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-377532167, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszASLPx_gt8kRINiBUsgTCLBKbtGJks5tjjyqgaJpZM4S-iLv .
The use of this storage in particular has never been an issue for earliers dask versions (0.15.4 for example). Modifying config.yaml according to
worker-memory-target: false # don't spill to disk
worker-memory-spill: false # don't spill to disk
worker-memory-pause: 0.80 # fraction at which we pause worker threads
worker-memory-terminate: 0.95 # fraction at which we terminate the worker
had not effect
I thought you had tried older dask versions and seen that it worked fine?
If you have locally attached storage I recommend trying specifying a --local-directory that is not on the network
In general I'm sorry that you're having trouble. I don't know what's going on with your system. I'm trying to propose some things to try but I'm unable to spend a lot of time tracking down what's going on in your case.
On Fri, Mar 30, 2018 at 10:24 AM, Aurélien Ponte notifications@github.com wrote:
The use of this storage in particular has never been an issue for earliers dask versions (0.15.4 for example). Modifying config.yaml according to
worker-memory-target: false # don't spill to disk worker-memory-spill: false # don't spill to disk worker-memory-pause: 0.80 # fraction at which we pause worker threads worker-memory-terminate: 0.95 # fraction at which we terminate the worker
had not effect
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-377535011, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszM0GwrefUIewhuS-LdpVIIjzsHBgks5tjkAFgaJpZM4S-iLv .
I also have no idea if the directories are a problem, I'm just on this because you suggested it above that they were clobbering each other. This is not terribly surprising when running on slow NFS systems that aren't terribly consistent
On Fri, Mar 30, 2018 at 10:30 AM, Matthew Rocklin mrocklin@anaconda.com wrote:
I thought you had tried older dask versions and seen that it worked fine?
If you have locally attached storage I recommend trying specifying a --local-directory that is not on the network
In general I'm sorry that you're having trouble. I don't know what's going on with your system. I'm trying to propose some things to try but I'm unable to spend a lot of time tracking down what's going on in your case.
On Fri, Mar 30, 2018 at 10:24 AM, Aurélien Ponte <notifications@github.com
wrote:
The use of this storage in particular has never been an issue for earliers dask versions (0.15.4 for example). Modifying config.yaml according to
worker-memory-target: false # don't spill to disk worker-memory-spill: false # don't spill to disk worker-memory-pause: 0.80 # fraction at which we pause worker threads worker-memory-terminate: 0.95 # fraction at which we terminate the worker
had not effect
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-377535011, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszM0GwrefUIewhuS-LdpVIIjzsHBgks5tjkAFgaJpZM4S-iLv .
Maybe someone else with more experience using Dask on these systems can help.
On Fri, Mar 30, 2018 at 10:32 AM, Matthew Rocklin mrocklin@anaconda.com wrote:
I also have no idea if the directories are a problem, I'm just on this because you suggested it above that they were clobbering each other. This is not terribly surprising when running on slow NFS systems that aren't terribly consistent
On Fri, Mar 30, 2018 at 10:30 AM, Matthew Rocklin mrocklin@anaconda.com wrote:
I thought you had tried older dask versions and seen that it worked fine?
If you have locally attached storage I recommend trying specifying a --local-directory that is not on the network
In general I'm sorry that you're having trouble. I don't know what's going on with your system. I'm trying to propose some things to try but I'm unable to spend a lot of time tracking down what's going on in your case.
On Fri, Mar 30, 2018 at 10:24 AM, Aurélien Ponte < notifications@github.com> wrote:
The use of this storage in particular has never been an issue for earliers dask versions (0.15.4 for example). Modifying config.yaml according to
worker-memory-target: false # don't spill to disk worker-memory-spill: false # don't spill to disk worker-memory-pause: 0.80 # fraction at which we pause worker threads worker-memory-terminate: 0.95 # fraction at which we terminate the worker
had not effect
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-377535011, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszM0GwrefUIewhuS-LdpVIIjzsHBgks5tjkAFgaJpZM4S-iLv .
Regarding the local directory, the only last thing I have not tried is using my home directory. I'll give it a shot, regardless of one of your comment above.
Except for rabernat, do you have in mind people I could ping about this?
You could ask on the pangeo issue tracker. There are a number of HPC users there. https://github.com/pangeo-data/pangeo/issues
On Fri, Mar 30, 2018 at 10:36 AM, Aurélien Ponte notifications@github.com wrote:
Regarding the local directory, the only last thing I have not tried is using my home directory. I'll give it a shot, regardless of one of your comment above.
Except for rabernat, do you have in mind people I could ping about this?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-377537490, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszElwCMAazDwsARR0NJ7crBIRXXLMks5tjkLwgaJpZM4S-iLv .
I can only support @mrocklin comments on shared file system. Always use the local-directory option to point to a server local storage like /tmp. With pbs you may have an environment variable to use like TMPDIR or PBS_TMPDIR.
I updated dask/xarray via conda yesterday (to distributed-1.21.4) and started having issues with workers not starting on an hpc system. Research eventually brought me to 31c470e which I tried. It seems to solve some issues i.e.:
leads to:
I can work however but have some issues when trying to upload files (I realize this may be another issue):
leads to:
The
/home1/scratch/aponte/dask/worker-oft8kj37/
does not exist indeed.When I look for
oft8kj37
in log files I get the following. The last hit may be relevant: