dask / hdfs3

A wrapper for libhdfs3 to interact with HDFS from Python
http://hdfs3.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
136 stars 40 forks source link

Bug? The lib skip directory contents #161

Open timoninmaxim opened 6 years ago

timoninmaxim commented 6 years ago

Hello, I try to find out a reason of the behaviour. It looks like a bug

hdfs3 was installed from the conda-forge.

So with simple hadoop ls commands:

$ hadoop fs -ls / | grep smartdata drwxr-xr-x - hdfs supergroup 0 2018-02-16 11:39 /smartdata $ hadoop fs -ls /smartdata drwxrwxrwx - hdfs supergroup 0 2018-02-02 11:35 /smartdata/hive ... $ hadoop fs -ls /smartdata/hive drwxrwxrwx - hdfs supergroup 0 2018-04-26 09:32 /smartdata/hive/external ...

And this is with hdfs3 lib. It does not show listing of /smartdata, but does it for /smartdata/hive.

$ python -c 'from hdfs3 import HDFileSystem; print HDFileSystem().ls("/")' [u'//smartdata', ...] $ python -c 'from hdfs3 import HDFileSystem; print HDFileSystem().ls("/smartdata")' [] $ python -c 'from hdfs3 import HDFileSystem; print HDFileSystem().ls("/smartdata/hive")' [u'/smartdata/hive/external', u'/smartdata/hive/hql', ...]

I saw similar issue here. But it seems it is not permission issue: https://stackoverflow.com/questions/40405527/python-hdfs3-fails-to-list-non-owned-files

martindurant commented 6 years ago

Hm, honestly I have no idea what might be going on here. Does walk fail to descend into the directories?

timoninmaxim commented 6 years ago

Hello, I have found the bug. hdfs3 does not work correctly with ACL.

$ python -c 'from hdfs3 import HDFileSystem; print HDFileSystem().ls("/user")' [u'/user/cloudera', u'/user/history', u'/user/hive', u'/user/hue', u'/user/jenkins', u'/user/oozie', u'/user/root', u'/user/spark'] $ hadoop fs -setfacl -m user:jenkins:rwx /user/cloudera $ python -c 'from hdfs3 import HDFileSystem; print HDFileSystem().ls("/user")' [] $ python -c 'from hdfs3 import HDFileSystem; print HDFileSystem().ls("/user/cloudera")' Traceback (most recent call last): File "", line 1, in File "/opt/miniconda/lib/python2.7/site-packages/hdfs3/core.py", line 380, in ls raise FileNotFoundError(path) IOError: /user/cloudera

timoninmaxim commented 6 years ago

I found similar issue for libhdfs3-downstream https://github.com/ContinuumIO/libhdfs3-downstream/issues/4

It is fixed for master branch, but conda-forge goes to concat branch for installing libhdfs3 https://github.com/conda-forge/libhdfs3-feedstock/blob/47aec11797cf29907738a02941690ef81de2fcfd/recipe/build.sh#L3

martindurant commented 6 years ago

Yes, I agree that libhdfs3 should be rereleased, I can try to get to that soon.