Closed DonDebonair closed 5 years ago
You are, of course correct. However, it turned out that libhdfs3 was very difficult to get right regarding the myriad hadoop security settings. Since most HDFS users seem to access from the cluster, especially if doing parallel work with Dask, it seemed better to allow the java stuff to handle that side of things.
For access from outside the cluster, I would recommend either sticking with the old hdfs3 (which should still work), or, more likely, the more straight-forward webhdfs. The corresponding python libraries don't look particularly complete, but it wouldn't take much effort to build one out, especially given the generic code in fsspec.
Alternatively, maintainers for libhdfs3 would be most welcome! It is a project which has been reborn and abandoned multiple times.
Thanks for your quick reply! I will look into the solutions you mentioned and figure out the right approach for my projects. WebHDFS seems the most straightforward, especially since I'm not looking to move massive amounts of data. Maintaining libhdfs3 is sadly not something I can commit to :(
Note that webHDFS will need to be enabled for your HDFS system, and you may require kerberos authentication (not too likely), which requests-kerberos should handle for you.
We do have a Kerberized cluster, so thanks for the pointer!
Hi all,
It says in the README:
This means that you're ignoring an important and obvious use case: accessing HDFS without the JVM. I think this was one of the main reasons hdfs3 was created in the first place. PyArrow's hdfs functionality doesn't solve this because it requires the JVM and all Hadoop jars to be present, which is especially inconvenient if you have a Python app that you want to run inside a Docker container, and connect to an HDFS cluster from there. I have no idea how to do that with PyArrow.
What are your thoughts on that?