dask / hdfs3

A wrapper for libhdfs3 to interact with HDFS from Python
http://hdfs3.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
136 stars 40 forks source link

Accessing HDFS without the need of JVM #171

Closed DonDebonair closed 5 years ago

DonDebonair commented 5 years ago

Hi all,

It says in the README:

Pyarrow's JNI hdfs interface is mature and stable. It also has fewer problems with configuration and various security settings, and does not require the complex build process of libhdfs3. Therefore, all users who have trouble with hdfs3 are recommended to try pyarrow.

This means that you're ignoring an important and obvious use case: accessing HDFS without the JVM. I think this was one of the main reasons hdfs3 was created in the first place. PyArrow's hdfs functionality doesn't solve this because it requires the JVM and all Hadoop jars to be present, which is especially inconvenient if you have a Python app that you want to run inside a Docker container, and connect to an HDFS cluster from there. I have no idea how to do that with PyArrow.

What are your thoughts on that?

martindurant commented 5 years ago

You are, of course correct. However, it turned out that libhdfs3 was very difficult to get right regarding the myriad hadoop security settings. Since most HDFS users seem to access from the cluster, especially if doing parallel work with Dask, it seemed better to allow the java stuff to handle that side of things.

For access from outside the cluster, I would recommend either sticking with the old hdfs3 (which should still work), or, more likely, the more straight-forward webhdfs. The corresponding python libraries don't look particularly complete, but it wouldn't take much effort to build one out, especially given the generic code in fsspec.

martindurant commented 5 years ago

Alternatively, maintainers for libhdfs3 would be most welcome! It is a project which has been reborn and abandoned multiple times.

DonDebonair commented 5 years ago

Thanks for your quick reply! I will look into the solutions you mentioned and figure out the right approach for my projects. WebHDFS seems the most straightforward, especially since I'm not looking to move massive amounts of data. Maintaining libhdfs3 is sadly not something I can commit to :(

martindurant commented 5 years ago

Note that webHDFS will need to be enabled for your HDFS system, and you may require kerberos authentication (not too likely), which requests-kerberos should handle for you.

DonDebonair commented 5 years ago

We do have a Kerberized cluster, so thanks for the pointer!