dask / knit

Deprecated, please use https://github.com/jcrist/skein or https://github.com/dask/dask-yarn instead
http://knit.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
53 stars 10 forks source link

JVM fails to report back for custom env #95

Closed quartox closed 7 years ago

quartox commented 7 years ago

I pass an env with the path to my current environment to DaskYarnCluster and I can see the zip file being built and uploaded but then I get an error (below) saying that the JVM fails to report back. Passing the channel conda-forge starts up the environment correctly, but fails to load a parquet file from hdfs because it lacks hdfs3.

My environment should be a superset of the other environment with the same versions of dask and distributed both from the conda-forge channel.

Exception                                 Traceback (most recent call last)
<ipython-input-12-267836867704> in <module>()
     13                                 'rm_port': resource_manager_port})
     14 client = Client(cluster)
---> 15 cluster.start(2, cpus=1, memory=500)
     16 
     17 #future = client.submit(lambda x: x + 1, 10)

/nas/isg_prodops_work/jlord/conda/envs/dask/lib/python3.6/site-packages/knit-0.2.2-py3.6.egg/knit/dask_yarn.py in start(self, n_workers, cpus, memory, checks, **kwargs)
    127         app_id = self.knit.start(command, env=self.env,
    128                                  num_containers=n_workers, virtual_cores=cpus,
--> 129                                  memory=memory, checks=checks, **kwargs)
    130         self.app_id = app_id
    131         return app_id

/nas/isg_prodops_work/jlord/conda/envs/dask/lib/python3.6/site-packages/knit-0.2.2-py3.6.egg/knit/core.py in start(self, cmd, num_containers, virtual_cores, memory, env, files, app_name, queue, checks)
    294  - that the cluster is otherwise unhealthy - check the RM and NN logs
    295    (use k.yarn_api.system_logs() to find these on a one-node system
--> 296 """)
    297         master_rpchost = self.client.masterRPCHost()
    298 

Exception: The application master JVM process failed to report back. This can mean:
 - that the YARN cluster cannot scheduler adequate resources - check
   k.yarn_api.cluster_metrics() and other diagnostic methods;
 - that the ApplicationMaster crashed - check the application logs, k.logs();
 - that the cluster is otherwise unhealthy - check the RM and NN logs 
   (use k.yarn_api.system_logs() to find these on a one-node system
martindurant commented 7 years ago

Do you see something like this in the RM logs: Failing this attempt.Diagnostics: java.io.IOException: Mkdirs failed to create /tmp/hadoop-root/nm-local-dir/filecache/12_tmp/test.zip/../../../../../opt/conda/envs/test/etc/conda/deactivate.d

CondaCreator.zip_env uses relpath when writing - perhaps any files outside of the dirctory tree should be ignored?

martindurant commented 7 years ago

Also, in one test I just did, I get too-many-symbolic-links on a terminal-like file when zip does os.stat(filename). The zipping method should apparently have some safeguards.

quartox commented 7 years ago

Yes, the containers have no logs, but I see that diagnostic message:

Diagnostics: java.io.IOException: Mkdirs failed to create /hadoop06/yarn/nm/filecache/105_tmp/dask.zip/../../../../../../../conda-meta
Failing this attempt. Failing the application.