dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 718 forks source link

restart time out and upload_file crash #1869

Open apatlpo opened 6 years ago

apatlpo commented 6 years ago

I updated dask/xarray via conda yesterday (to distributed-1.21.4) and started having issues with workers not starting on an hpc system. Research eventually brought me to 31c470e which I tried. It seems to solve some issues i.e.:

I can work however but have some issues when trying to upload files (I realize this may be another issue):

client.upload_file('/home1/datahome/aponte/iwave_sst/hw/utils.py')

leads to:

distributed.utils - ERROR - [Errno 2] No such file or directory: '/home1/scratch/aponte/dask/worker-oft8kj37/utils.py'
Traceback (most recent call last):
  File "/home1/datahome/aponte/.miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/utils.py", line 238, in f
    result[0] = yield make_coro()
  File "/home1/datahome/aponte/.miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home1/datahome/aponte/.miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)

The /home1/scratch/aponte/dask/worker-oft8kj37/ does not exist indeed.

When I look for oft8kj37 in log files I get the following. The last hit may be relevant:

iwave_sst/datarmor% grep oft8kj37 dask-*
dask-scheduler.o958818:distributed.core - ERROR - [Errno 2] No such file or directory: '/home1/scratch/aponte/dask/worker-oft8kj37/utils.py'
dask-scheduler.o958818:FileNotFoundError: [Errno 2] No such file or directory: '/home1/scratch/aponte/dask/worker-oft8kj37/utils.py'
dask-worker.o958819:distributed.worker - INFO -       Local Directory: /home1/scratch/aponte/dask/worker-oft8kj37
dask-worker.o958819:distributed.core - ERROR - [Errno 2] No such file or directory: '/home1/scratch/aponte/dask/worker-oft8kj37/utils.py'
dask-worker.o958819:FileNotFoundError: [Errno 2] No such file or directory: '/home1/scratch/aponte/dask/worker-oft8kj37/utils.py'
dask-worker.o958819:distributed.core - ERROR - [Errno 2] No such file or directory: '/home1/scratch/aponte/dask/worker-oft8kj37/utils.py'
dask-worker.o958819:FileNotFoundError: [Errno 2] No such file or directory: '/home1/scratch/aponte/dask/worker-oft8kj37/utils.py'
dask-worker.o958822:distributed.diskutils - WARNING - Found stale lock file and directory '/home1/scratch/aponte/dask/worker-oft8kj37', purging
mrocklin commented 6 years ago

Can you try starting your client with the following:

client = Client(..., timeout='20s')

and then seeing how the restart does?

mrocklin commented 6 years ago

For the file issue I'm not surprised to see things not work well on an NFS system. See http://dask.pydata.org/en/latest/setup/hpc.html#no-local-storage

If you have an NFS then maybe just use normal techniques like PYTHONPATH or installing modules rather than use upload_file

apatlpo commented 6 years ago

With 20s:

distributed.client - ERROR - Restart timed out after 20.000000 seconds

With 60s:

distributed.batched - INFO - Batched Comm Closed: in <closed TCP>: BrokenPipeError: [Errno 32] Broken pipe
tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: scheduler='tcp://10.148.0.106:8786' processes=33 cores=132>>
Traceback (most recent call last):
  File "/home1/datahome/aponte/.miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/ioloop.py", line 1209, in _run
    return self.callback()
  File "/home1/datahome/aponte/.miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/client.py", line 846, in _heartbeat
    self.scheduler_comm.send({'op': 'heartbeat'})
  File "/home1/datahome/aponte/.miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/batched.py", line 106, in send
    raise CommClosedError

the message repeats itself several times

mrocklin commented 6 years ago

How are you starting your workers? What kind of system are you on?

apatlpo commented 6 years ago

I am on a cluster PBS : https://wwz.ifremer.fr/pcdm/Equipement I run ./launch-dask.sh ${N_WORK_NODES} following https://pangeo-data.github.io/pangeo/setup_guides/cheyenne.html It was working fine until a recent update xarray/dask/...

mrocklin commented 6 years ago

It was working fine until a recent update xarray/dask/...

Would you be willing to use git bisect to find the cause? "No" is a fine answer

apatlpo commented 6 years ago

I started doing this but am right away confused.

The working version is dask 0.15.4 and distributed 1.19.3

The failing version is dask 0.17.2 and distributed 1.18.1+288.g31c470e

Isn't 31c470e based on the latest distributed version?

mrocklin commented 6 years ago

Correct. You would only be bisecting in the distributed repository. You would no longer use https://github.com/dask/distributed/commit/31c470e83ca5c66787be151e9c39f1f9b29f523a at all, instead you would bound your search by a commit that you knew worked (like version 1.19.3) and a commit that you knew did not work (like version 1.21.4) and then you would use git bisect to search within that range to find the offending commit. This process is very useful, but can be confusing at first.

On Wed, Mar 28, 2018 at 8:53 AM, Aurélien Ponte notifications@github.com wrote:

I started doing this but am right away confused.

The working version is dask 0.15.4 and distributed 1.19.3

The failing version is dask 0.17.2 and distributed 1.18.1+288.g31c470e

Isn't 31c470e https://github.com/dask/distributed/commit/31c470e83ca5c66787be151e9c39f1f9b29f523a based on the latest distributed version?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-376875575, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszDm3Lb4BTjv7G2oi3GHRkfuGmeaLks5ti4fGgaJpZM4S-iLv .

apatlpo commented 6 years ago

Am I supposed to this simultaneously for dask AND distributed? If yes, I am not sure how I can keep consistent versions of both libraries while bisecting.

mrocklin commented 6 years ago

Hopefully there is enough flexibility that everything will work smoothly with a single recent version of Dask.

Also, you have no obligation to do this. It would be useful, but if it proves to be too difficult then we can probably find the issue some other way with more time.

On Wed, Mar 28, 2018 at 9:29 AM, Aurélien Ponte notifications@github.com wrote:

Am I supposed to this simultaneously for dask AND distributed?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-376886776, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszBsCZDf9gIZMharnnDTYxHL8gdl1ks5ti5AwgaJpZM4S-iLv .

apatlpo commented 6 years ago

Hum, distributed version 1.19.3 does not seem to be compatible with latest dask.

The combination does not even produce a scheduler.json

The following error on the scheduler side occurs:

  File "/home1/datahome/aponte/.miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home1/datahome/aponte/.miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home1/datahome/aponte/.miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home1/datahome/aponte/distributed/distributed/cli/dask_mpi.py", line 57, in main
    services=services)
  File "/home1/datahome/aponte/distributed/distributed/scheduler.py", line 376, in __init__
    **kwargs)
  File "/home1/datahome/aponte/distributed/distributed/node.py", line 37, in __init__
    deserialize=deserialize, io_loop=self.io_loop)
  File "/home1/datahome/aponte/distributed/distributed/core.py", line 117, in __init__
    pc = PeriodicCallback(self.monitor.update, 500, io_loop=self.io_loop)
TypeError: __init__() got an unexpected keyword argument 'io_loop'
Terminated
apatlpo commented 6 years ago

argh, this is a distributed-tornado compatibility issue in fact, see #1172

The bisect seems difficult to me

apatlpo commented 6 years ago

Unless you see an easier alternative, I'll try a manual bisect with conda install to narrow down where the problem arise

mrocklin commented 6 years ago

Versions may not be sufficiently granular to identify the problem. Ideally we would be able to find the exact commit in git where the error was introduced. That may prove to be too difficult though.

On Wed, Mar 28, 2018 at 2:44 PM, Aurélien Ponte notifications@github.com wrote:

Unless you see an easier alternative, I'll try a manual bisect with conda install to narrow down where the problem arise

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-376993793, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszEYw4PuphFfkLkgXTKBqn-CZc-Wqks5ti9n2gaJpZM4S-iLv .

apatlpo commented 6 years ago

I am not sure this is a good news but I find that problems arise as early as dask v0.16.0. Here are things that change from 0.15.4 to 0.16.0 for my deployments:

Does it help? How can I diagnose what may be happening?

mrocklin commented 6 years ago

Going back to the beginning here a bit. Can you verify that your workers actually connect and work normally before you try to restart? Is it just restarting that is the problem?

apatlpo commented 6 years ago

I believe they do, i.e. I get the right number of workers and I can run computations that do not involve calls to utils.py. The upload_file still breaks however but I want to emphasize I did not try your suggestions about PYTHONPATH or module install approach.

I attach a log file for one of the worker (note this is for dask 0.16.0): dask-worker.o965502.log

apatlpo commented 6 years ago

@rabernat , have you encountered issues (slower responsiveness, restart time outs, unability to use upload_file) on habanero with dask versions above 0.16.0 ?

apatlpo commented 6 years ago

I have reverted to the latest version of dask (0.17.2 / distributed 1.21.4).

Here is the log for one worker: dask-worker.o966682.log

I am not sure what else I could try.

rabernat commented 6 years ago

In my experience, restarting workers on a cluster has never actually worked reliably. The minute I have a worker die, I usually just kill the whole cluster and start over. I could never summon the motivation to debug this deeply.

However, your experience sounds worse than what I am used to. A properly configured HPC cluster can perform very well, provided that the workers don't run out of memory or experience errors causing them to die.

apatlpo commented 6 years ago

This does not sound too optimistic.

I reverted to using latest distribution (04e7e7a7d6d3cee11ef2a4f2f4d7622d9a21e2eb) which includes the #1865.

Here is the behavior that I observe now:

mrocklin commented 6 years ago

You might consider not using your home directory for scratch space. I recommend using the --local-directory keyword to dask-worker (however you're launching dask-worker). Some more information here: http://dask.pydata.org/en/latest/setup/hpc.html#no-local-storage

On Fri, Mar 30, 2018 at 10:02 AM, Aurélien Ponte notifications@github.com wrote:

This does not sound too optimistic.

I reverted to using latest distribution (04e7e7a https://github.com/dask/distributed/commit/04e7e7a7d6d3cee11ef2a4f2f4d7622d9a21e2eb) which includes the #1865 https://github.com/dask/distributed/pull/1865.

Here is the behavior that I observe now:

  • workers show up in dask dash board but in jupyter notebooks, client tells me however that there are no workers
  • I have multiple FileNotFoundError errors in worker logs that seem to indicate that workers are deleting each other work directories. Example:

(pangeon) aponte@datarmor2:~/iwave_sst/datarmor> grep worker-i0oaruup dask* dask-worker.o967959:FileNotFoundError: [Errno 2] No such file or directory: '/home1/datawork/aponte/dask/worker-i0oaruup/storage' dask-worker.o967962:distributed.diskutils - WARNING - Found stale lock file and directory '/home1/datawork/aponte/dask/worker-i0oaruup', purging

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-377530632, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszLYcDKV7CI3JlMtoKkOc5uIoqmRMks5tjjrlgaJpZM4S-iLv .

apatlpo commented 6 years ago

I am using a work directory where I have a comfortable amount of space with the following command:

mpirun --np 7 dask-mpi --nthreads 4 \
     --memory-limit 1e10 \
     --interface ib0 \
     --local-directory ${DATAWORK}/dask \
     --scheduler-file=${SCHEDULER}
mrocklin commented 6 years ago

Ideally this would live on locally attached storage. Network storage isn't particularly good for frequently used scratch space like this, no matter how large it is.

On Fri, Mar 30, 2018 at 10:09 AM, Aurélien Ponte notifications@github.com wrote:

I am using a work directory where I have a confortable amount of space with the following command:

mpirun --np 7 dask-mpi --nthreads 4 \ --memory-limit 1e10 \ --interface ib0 \ --local-directory ${DATAWORK}/dask \ --scheduler-file=${SCHEDULER}

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-377532167, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszASLPx_gt8kRINiBUsgTCLBKbtGJks5tjjyqgaJpZM4S-iLv .

apatlpo commented 6 years ago

The use of this storage in particular has never been an issue for earliers dask versions (0.15.4 for example). Modifying config.yaml according to

worker-memory-target: false  # don't spill to disk
worker-memory-spill: false  # don't spill to disk
worker-memory-pause: 0.80  # fraction at which we pause worker threads
worker-memory-terminate: 0.95  # fraction at which we terminate the worker

had not effect

mrocklin commented 6 years ago

I thought you had tried older dask versions and seen that it worked fine?

If you have locally attached storage I recommend trying specifying a --local-directory that is not on the network

In general I'm sorry that you're having trouble. I don't know what's going on with your system. I'm trying to propose some things to try but I'm unable to spend a lot of time tracking down what's going on in your case.

On Fri, Mar 30, 2018 at 10:24 AM, Aurélien Ponte notifications@github.com wrote:

The use of this storage in particular has never been an issue for earliers dask versions (0.15.4 for example). Modifying config.yaml according to

worker-memory-target: false # don't spill to disk worker-memory-spill: false # don't spill to disk worker-memory-pause: 0.80 # fraction at which we pause worker threads worker-memory-terminate: 0.95 # fraction at which we terminate the worker

had not effect

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-377535011, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszM0GwrefUIewhuS-LdpVIIjzsHBgks5tjkAFgaJpZM4S-iLv .

mrocklin commented 6 years ago

I also have no idea if the directories are a problem, I'm just on this because you suggested it above that they were clobbering each other. This is not terribly surprising when running on slow NFS systems that aren't terribly consistent

On Fri, Mar 30, 2018 at 10:30 AM, Matthew Rocklin mrocklin@anaconda.com wrote:

I thought you had tried older dask versions and seen that it worked fine?

If you have locally attached storage I recommend trying specifying a --local-directory that is not on the network

In general I'm sorry that you're having trouble. I don't know what's going on with your system. I'm trying to propose some things to try but I'm unable to spend a lot of time tracking down what's going on in your case.

On Fri, Mar 30, 2018 at 10:24 AM, Aurélien Ponte <notifications@github.com

wrote:

The use of this storage in particular has never been an issue for earliers dask versions (0.15.4 for example). Modifying config.yaml according to

worker-memory-target: false # don't spill to disk worker-memory-spill: false # don't spill to disk worker-memory-pause: 0.80 # fraction at which we pause worker threads worker-memory-terminate: 0.95 # fraction at which we terminate the worker

had not effect

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-377535011, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszM0GwrefUIewhuS-LdpVIIjzsHBgks5tjkAFgaJpZM4S-iLv .

mrocklin commented 6 years ago

Maybe someone else with more experience using Dask on these systems can help.

On Fri, Mar 30, 2018 at 10:32 AM, Matthew Rocklin mrocklin@anaconda.com wrote:

I also have no idea if the directories are a problem, I'm just on this because you suggested it above that they were clobbering each other. This is not terribly surprising when running on slow NFS systems that aren't terribly consistent

On Fri, Mar 30, 2018 at 10:30 AM, Matthew Rocklin mrocklin@anaconda.com wrote:

I thought you had tried older dask versions and seen that it worked fine?

If you have locally attached storage I recommend trying specifying a --local-directory that is not on the network

In general I'm sorry that you're having trouble. I don't know what's going on with your system. I'm trying to propose some things to try but I'm unable to spend a lot of time tracking down what's going on in your case.

On Fri, Mar 30, 2018 at 10:24 AM, Aurélien Ponte < notifications@github.com> wrote:

The use of this storage in particular has never been an issue for earliers dask versions (0.15.4 for example). Modifying config.yaml according to

worker-memory-target: false # don't spill to disk worker-memory-spill: false # don't spill to disk worker-memory-pause: 0.80 # fraction at which we pause worker threads worker-memory-terminate: 0.95 # fraction at which we terminate the worker

had not effect

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-377535011, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszM0GwrefUIewhuS-LdpVIIjzsHBgks5tjkAFgaJpZM4S-iLv .

apatlpo commented 6 years ago

Regarding the local directory, the only last thing I have not tried is using my home directory. I'll give it a shot, regardless of one of your comment above.

Except for rabernat, do you have in mind people I could ping about this?

mrocklin commented 6 years ago

You could ask on the pangeo issue tracker. There are a number of HPC users there. https://github.com/pangeo-data/pangeo/issues

On Fri, Mar 30, 2018 at 10:36 AM, Aurélien Ponte notifications@github.com wrote:

Regarding the local directory, the only last thing I have not tried is using my home directory. I'll give it a shot, regardless of one of your comment above.

Except for rabernat, do you have in mind people I could ping about this?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1869#issuecomment-377537490, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszElwCMAazDwsARR0NJ7crBIRXXLMks5tjkLwgaJpZM4S-iLv .

guillaumeeb commented 6 years ago

I can only support @mrocklin comments on shared file system. Always use the local-directory option to point to a server local storage like /tmp. With pbs you may have an environment variable to use like TMPDIR or PBS_TMPDIR.