Open rjk-lm95 opened 4 years ago
Hey @rjk-lm95 -- caveat emptor: I have very little experience with hadoop clusters.
Have you created a user to run the dask-gateway-server
process? If that user account doesn't have access to read from your HDFS store, then the workers also won't have those permissions.
Hi @gforsyth , Thanks for the response. Yes, I do have an user id to start the dask-gateway-server. Based on the setup instructions for dask-gateway in yarn I thought it would impersonate the actual user's permissions like Apache Livy does. That may be a separate topic.
But for this purpose, I updated the permission for a sample file to allow it to be read by any user. It still throws the same permission error.
What is throwing me off is that dask seems to be able to read the structure of the csv file but fails when reading further.
dask-gateway (when configured correctly) does impersonation the same as any other project running on Hadoop/YARN (meaning if user alice
starts a dask cluster, the worker/scheduler containers will be running as user alice
). My guess is this is a bug in the webhdfs component of fsspec
, where it's not properly loading the delegation token from the container environment. That would also match your experience with dask being able to infer the structure but not run - tasks for setting up the graph (including reading metadata from files) are currently done client-side, which would be done locally (not on the workers).
If you could file an issue in fsspec
(and link that here) that'd be great: https://github.com/intake/filesystem_spec.
@jcrist I've created an issue in fsspec https://github.com/intake/filesystem_spec/issues/292
Is there another way to read the data from HDFS directly into the dask process instead of going via webhdfs through the namenode from the worker? Since the dask cluster is already setup on the hadoop cluster, wondering if there is an alternate way?
The two options for reading data from hdfs are hdfs
and webhdfs
. Usually hdfs
access can only happen when inside the cluster, while webhdfs
is sometimes exposed outside the cluster as well. Usually hdfs
is more efficient, but depending on data size this may be negligible.
The trick here is that dask (currently) does some file access on the client side (which in your case is outside the cluster), so you may not be able to use hdfs
directly. There's an open issue in dask to change this, so all IO operations are done on the workers (see https://github.com/dask/dask/issues/5380).
@jcrist I am trying to figure out what's the best way to use the dask gateway setup on yarn. I see definite advantage of using dask in terms of speed and closeness to native python over spark. It scales up like a charm on our Hadoop cluster.
However since our data is primarily on hdfs, I am trying to figure how we can use this power of dask with the data sitting on Hadoop cluster.
Do you have any thoughts on how we can get around this error that wouldn't let us read data from hdfs?
Hi,
I have dask-gateway set up on our hadoop cluster with yarn.
Without establishing a connection to the gateway, a simple csv read from hdfs works perfectly.
However as soon as I establish the gateway cluster connection, this fails with Exception: PermissionError() in the worker log.
A simple print(df) is able to read the dataframe structure correctly.
I'd appreciate any inputs on what I'm missing!