Open perryhook opened 7 years ago
From the logs above I suspect there may be a version mismatch issue and yarn is exiting early. To confirm can you pull the logs from that job and post ?
yarn logs -applicationId application_1491599580352_2891
I'm guessing at version mismatch between py27/py3.5 because you are running anaconda3
but the env being shipping appears to be 2.7. To force a python version you can define the packages in the constructor:
YARNCluster(packages=['python=3.5',])
Perhaps this need to be detected/improved ?
Hmm, well I am using Anaconda3, but the Conda env I'm using is definitely 2.7 . But I see a lot of "py36" in the logs. So maybe that is the issue? Does this env need to be setup on every node? Do I have to use Anaconda2?
I've attached the (scrubbed) log you requested. clean_application_1491599580352_2891.log.txt There's a Python error and stacktrace at the end:
File "./PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58//bin/dask-worker", line 4, in <module>
import distributed.cli.dask_worker
File "/data/sde/hadoop/yarn/local/usercache/me/appcache/application_1491599580352_2891/container_e20_1491599580352_2891_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/__init__.py", line 3, in <module>
from .config import config
File "/data/sde/hadoop/yarn/local/usercache/me/appcache/application_1491599580352_2891/container_e20_1491599580352_2891_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/config.py", line 46, in <module>
ensure_config_file()
File "/data/sde/hadoop/yarn/local/usercache/me/appcache/application_1491599580352_2891/container_e20_1491599580352_2891_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/config.py", line 26, in ensure_config_file
os.mkdir(os.path.dirname(destination))
PermissionError: [Errno 13] Permission denied: '/home/.dask```
I hope that helps! Thanks for helping me out with this.
This permission error has been fixed in master, which has just been released to PyPI. Conda packages are building now.
On Wed, May 3, 2017 at 7:53 PM, Perry Hook notifications@github.com wrote:
Hmm, well I am using Anaconda3, but the Conda env I'm using is definitely 2.7 . But I see a lot of "py36" in the logs. So maybe that is the issue? Does this env need to be setup on every node? Do I have to use Anaconda2?
I've attached the (scrubbed) log you requested. cleanapplication 1491599580352_2891.log.txt https://github.com/dask/dask-yarn/files/974955/clean_application_1491599580352_2891.log.txt There's a Python error and stacktrace at the end:
File "./PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58//bin/dask-worker", line 4, in
import distributed.cli.dask_worker File "/data/sde/hadoop/yarn/local/usercache/me/appcache/application_1491599580352_2891/container_e20_1491599580352_2891_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/init.py", line 3, in from .config import config File "/data/sde/hadoop/yarn/local/usercache/me/appcache/application_1491599580352_2891/container_e20_1491599580352_2891_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/config.py", line 46, in ensure_config_file() File "/data/sde/hadoop/yarn/local/usercache/me/appcache/application_1491599580352_2891/container_e20_1491599580352_2891_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/config.py", line 26, in ensure_config_file os.mkdir(os.path.dirname(destination)) PermissionError: [Errno 13] Permission denied: '/home/.dask``` I hope that helps! Thanks for helping me out with this.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-yarn/issues/10#issuecomment-299067203, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszHXTFrDYeG8yJnkAxTXPGRbUCmLcks5r2RNpgaJpZM4NL9mb .
Dask does expect all of the client, scheduler, and workers, to have roughly the same software environment. Python 2 or 3 are both fine, as long as they are consistent.
On Wed, May 3, 2017 at 7:54 PM, Matthew Rocklin mrocklin@continuum.io wrote:
This permission error has been fixed in master, which has just been released to PyPI. Conda packages are building now.
On Wed, May 3, 2017 at 7:53 PM, Perry Hook notifications@github.com wrote:
Hmm, well I am using Anaconda3, but the Conda env I'm using is definitely 2.7 . But I see a lot of "py36" in the logs. So maybe that is the issue? Does this env need to be setup on every node? Do I have to use Anaconda2?
I've attached the (scrubbed) log you requested. clean_application_1491599580352_2891.log.txt https://github.com/dask/dask-yarn/files/974955/clean_application_1491599580352_2891.log.txt There's a Python error and stacktrace at the end:
File "./PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58//bin/dask-worker", line 4, in
import distributed.cli.dask_worker File "/data/sde/hadoop/yarn/local/usercache/me/appcache/application_1491599580352_2891/container_e20_1491599580352_2891_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/init.py", line 3, in from .config import config File "/data/sde/hadoop/yarn/local/usercache/me/appcache/application_1491599580352_2891/container_e20_1491599580352_2891_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/config.py", line 46, in ensure_config_file() File "/data/sde/hadoop/yarn/local/usercache/me/appcache/application_1491599580352_2891/container_e20_1491599580352_2891_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/config.py", line 26, in ensure_config_file os.mkdir(os.path.dirname(destination)) PermissionError: [Errno 13] Permission denied: '/home/.dask``` I hope that helps! Thanks for helping me out with this.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-yarn/issues/10#issuecomment-299067203, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszHXTFrDYeG8yJnkAxTXPGRbUCmLcks5r2RNpgaJpZM4NL9mb .
I'm a bit lost now... Specifying python=2.7 in the packages argument for YARNCluster() did change all the python3.6 stuff I saw in the logs to 2.7, but the behavior was still the same. The knit/YARN application runs and finishes, and future.result() never returns.
I created a new Conda env, installed dask 0.14.2 and distributed, and pip installed dask-yarn, but now I get this:
Exception AttributeError: "'LocalCluster' object has no attribute 'workers'" in <object repr() failed> ignored
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/me/anaconda3/envs/dask_test/lib/python2.7/site-packages/dask_yarn/core.py", line 39, in __init__
self.local_cluster = LocalCluster(n_workers=0, scheduler_port=0, ip=ip)
File "/home/me/anaconda3/envs/dask_test/lib/python2.7/site-packages/distributed/deploy/local.py", line 100, in __init__
from distributed.bokeh.scheduler import BokehScheduler
File "/home/me/anaconda3/envs/dask_test/lib/python2.7/site-packages/distributed/bokeh/scheduler.py", line 51, in <module>
with open(os.path.join(os.path.dirname(__file__), 'template.html')) as f:
IOError: [Errno 2] No such file or directory: '/home/me/anaconda3/envs/dask_test/lib/python2.7/site-packages/distributed/bokeh/template.html'
Any additional help would be great.
This was also reported here: https://github.com/dask/distributed/issues/1056
Care to chime in over there? How are you installing things?
@mrocklin when you say
Dask does expect all of the client, scheduler, and workers, to have roughly the same software environment. Python 2 or 3 are both fine, as long as they are consistent.
Does that mean Anaconda needs to be installed on each node as well, or is a node's system Python OK (as long as it is the correct version, 2 or 3)?
I'm using the updates from https://github.com/dask/distributed/issues/1056 now and don't get the missing template.html file anymore, but still get the original behavior of Yarn starting up an application, application finishing, and future.result()
never returning. Logs say Container exited with a non-zero exit code 1
. New logs are attached, if it helps.
I'm seeing the following in your logs:
LogType:stderr
Log Upload Time:Fri May 12 13:51:15 -0700 2017
LogLength:1691
Log Contents:
Traceback (most recent call last):
File "./PYTHON_DIR/dask-c95c2b1253a379cb762c3dda4a7448a3a325491e//bin/dask-worker", line 4, in <module>
import distributed.cli.dask_worker
File "/data/sdl/hadoop/yarn/local/usercache/me/appcache/application_1491599580352_4835/container_e20_1491599580352_4835_01_000002/PYTHON_DIR/dask-c95c2b1253a379cb762c3dda4a7448a3a325491e/lib/python2.7/site-packages/distributed/__init__.py", line 3, in <module>
from .config import config
File "/data/sdl/hadoop/yarn/local/usercache/me/appcache/application_1491599580352_4835/container_e20_1491599580352_4835_01_000002/PYTHON_DIR/dask-c95c2b1253a379cb762c3dda4a7448a3a325491e/lib/python2.7/site-packages/distributed/config.py", line 48, in <module>
ensure_config_file()
File "/data/sdl/hadoop/yarn/local/usercache/me/appcache/application_1491599580352_4835/container_e20_1491599580352_4835_01_000002/PYTHON_DIR/dask-c95c2b1253a379cb762c3dda4a7448a3a325491e/lib/python2.7/site-packages/distributed/config.py", line 34, in ensure_config_file
shutil.copy(default_path, tmp)
File "/data/sdl/hadoop/yarn/local/usercache/me/appcache/application_1491599580352_4835/container_e20_1491599580352_4835_01_000002/PYTHON_DIR/dask-c95c2b1253a379cb762c3dda4a7448a3a325491e/lib/python2.7/shutil.py", line 119, in copy
copyfile(src, dst)
File "/data/sdl/hadoop/yarn/local/usercache/me/appcache/application_1491599580352_4835/container_e20_1491599580352_4835_01_000002/PYTHON_DIR/dask-c95c2b1253a379cb762c3dda4a7448a3a325491e/lib/python2.7/shutil.py", line 83, in copyfile
with open(dst, 'wb') as fdst:
IOError: [Errno 2] No such file or directory: '/home/.dask/config.yaml.tmp.12881'
I thought that we had resolved this, and indeed we had, except that Python 2 uses a slightly different exception in this case. Generalized and solved (I think) in https://github.com/dask/distributed/pull/1083
Does that mean Anaconda needs to be installed on each node as well, or is a node's system Python OK (as long as it is the correct version, 2 or 3)?
Dask does not require Anaconda. You'll need to be able to install our requirements (tornado, msgpack, etc.), but nothing in there is too exotic.
Please let me know if I'm doing something wrong, but the example code does not work for me. I am trying to get this to work on an 8 node cluster.
This just hangs, never returning.
Immediately after cluster.start(), looking at YARN running applications with
$ yarn application -list
shows the application for a few moments before progress hits 100% and it disappears:No matter how fast I submit the lambda to the client and ask for the result, it hangs. It seems like the application is finishing before anything can be submitted to it. I barely know what I'm doing here with dask, so please let me know if this is a usage problem on my end, or something else.