dask / dask-gateway

A multi-tenant server for securely deploying and managing Dask clusters.
https://gateway.dask.org/
BSD 3-Clause "New" or "Revised" License
136 stars 88 forks source link

Permission error while reading HDFS file #263

Open rjk-lm95 opened 4 years ago

rjk-lm95 commented 4 years ago

Hi,

I have dask-gateway set up on our hadoop cluster with yarn.

Without establishing a connection to the gateway, a simple csv read from hdfs works perfectly.

df = dask.dataframe.read_csv('webhdfs://namnode.example.com/path/tocsv.csv', storage_options={'kerberos':True})
df.tail()

However as soon as I establish the gateway cluster connection, this fails with Exception: PermissionError() in the worker log.

A simple print(df) is able to read the dataframe structure correctly.

I'd appreciate any inputs on what I'm missing!

gforsyth commented 4 years ago

Hey @rjk-lm95 -- caveat emptor: I have very little experience with hadoop clusters.

Have you created a user to run the dask-gateway-server process? If that user account doesn't have access to read from your HDFS store, then the workers also won't have those permissions.

rjk-lm95 commented 4 years ago

Hi @gforsyth , Thanks for the response. Yes, I do have an user id to start the dask-gateway-server. Based on the setup instructions for dask-gateway in yarn I thought it would impersonate the actual user's permissions like Apache Livy does. That may be a separate topic.

But for this purpose, I updated the permission for a sample file to allow it to be read by any user. It still throws the same permission error.

What is throwing me off is that dask seems to be able to read the structure of the csv file but fails when reading further.

jcrist commented 4 years ago

dask-gateway (when configured correctly) does impersonation the same as any other project running on Hadoop/YARN (meaning if user alice starts a dask cluster, the worker/scheduler containers will be running as user alice). My guess is this is a bug in the webhdfs component of fsspec, where it's not properly loading the delegation token from the container environment. That would also match your experience with dask being able to infer the structure but not run - tasks for setting up the graph (including reading metadata from files) are currently done client-side, which would be done locally (not on the workers).

If you could file an issue in fsspec (and link that here) that'd be great: https://github.com/intake/filesystem_spec.

rjk-lm95 commented 4 years ago

@jcrist I've created an issue in fsspec https://github.com/intake/filesystem_spec/issues/292

rjk-lm95 commented 4 years ago

Is there another way to read the data from HDFS directly into the dask process instead of going via webhdfs through the namenode from the worker? Since the dask cluster is already setup on the hadoop cluster, wondering if there is an alternate way?

jcrist commented 4 years ago

The two options for reading data from hdfs are hdfs and webhdfs. Usually hdfs access can only happen when inside the cluster, while webhdfs is sometimes exposed outside the cluster as well. Usually hdfs is more efficient, but depending on data size this may be negligible.

The trick here is that dask (currently) does some file access on the client side (which in your case is outside the cluster), so you may not be able to use hdfs directly. There's an open issue in dask to change this, so all IO operations are done on the workers (see https://github.com/dask/dask/issues/5380).

rjk-lm95 commented 4 years ago

@jcrist I am trying to figure out what's the best way to use the dask gateway setup on yarn. I see definite advantage of using dask in terms of speed and closeness to native python over spark. It scales up like a charm on our Hadoop cluster.

However since our data is primarily on hdfs, I am trying to figure how we can use this power of dask with the data sitting on Hadoop cluster.

Do you have any thoughts on how we can get around this error that wouldn't let us read data from hdfs?