dask / knit

Deprecated, please use https://github.com/jcrist/skein or https://github.com/dask/dask-yarn instead
http://knit.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
53 stars 10 forks source link

Cleanup configuration handling #93

Closed jcrist closed 6 years ago

jcrist commented 6 years ago
jcrist commented 6 years ago

This spiraled out from finding a simple bug in the configuration inference (misuse of str.strip). I think the logic is identical to as before, but bugs might have slipped in. Lacking the environment setup to perform rigorous tests this is a bit hard to ensure.

martindurant commented 6 years ago

You could use docker image mdurant/hadoop for testing, which has hdfs ad yarn running.

A couple of thoughts:

jcrist commented 6 years ago

the code in hdfs3.conf is essentially the same as this was; should it be changed, or could both be factored out?

Hmmm, the bug exists there then as well. Since the nn/port stuff is only needed for hdfs, most of this could probably be pushed to hdfs3 instead.

a Knit object tries to create an HDFileSystem instance with the same conf; this could be much simpler if that didn't happen (we pass a hdfs instance in, for example)

Why can't you rely on creating an HDFileSystem internally, and let its init find the conf file? AFAICT none of the yarn code relies on config stuff found in hdfs-site.xml.

should Knit depend on hdfs3?

Since knit depends on hdfs support for working with python environments, I'd say probably?

Can we solve hdfs security problems by making a hadoop delegation token via yarn, which is already done in the scala code?

I don't know enough about this stuff to answer that :).

martindurant commented 6 years ago

Indeed the duplication with hdfs3 seems like a bad idea. The only pro reasons are: it's probably a good idea to have both find the same defaults directory if auto-configuring, in cases where there might be multiple sets of config files lying around; and if overriding config values for hdfs, it's a pain to have to do it in multiple places. Consider the alternative, that Knit or DaskYarnCluster takes an hdfs= parameter that the user has already configured.

Note that at the moment, hdfs3 is strictly optional, and the only use is to list environments stored for the user and to decide whether uploading a new .zip is required or not. The actual upload is done is scala right now. It would be good to use hdfs3 to clean the .knitDeps directory, access logs directly (instead of via the yarn CLI). Until "privacy" security is firmly fixed in libhdfs3, it is conceivable that in some circumstances yarn can upload files but hdfs3 cannot.