fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1k stars 354 forks source link

Permission error reading HDFS in a dask cluster environment #292

Open rjk-lm95 opened 4 years ago

rjk-lm95 commented 4 years ago

Hi,

I have a dask-gateway cluster setup on YARN and am trying to read a CSV file on HDFS.

Everything works fine when executing locally on the client import dask.dataframe as dd df = dd.read_csv('webhdfs://namenode.example.com/path/csvfile.csv, storage_options={'kerberos'=True}) df.tail()

However when I initiate a dask cluster, and execute the same piece of code and it tries to execute on a dask worker, it throws a PermissionError:

It appears that the delegation token isn't being loaded correctly from containing environment.

I'd appreciate any inputs or thoughts on this.

martindurant commented 4 years ago

Is the permission error happening in your local environment of within one of the workers?

rjk-lm95 commented 4 years ago

It is occurring on the worker. I see a PermissionError() in the worker log.

martindurant commented 4 years ago

@jcrist , any thoughts on kerberos in yarn for HTTP/spnego ? @rjk-lm95 , any reason you don't use HDFS, as opposed to webHDFS? Is this because your client is outside the cluster?

rjk-lm95 commented 4 years ago

@martindurant That is correct. The client is outside the Hadoop cluster. So I'm trying to use webhdfs.