dask / hdfs3

A wrapper for libhdfs3 to interact with HDFS from Python
http://hdfs3.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
136 stars 40 forks source link

libhdfs3 woes #138

Open nlevitt opened 6 years ago

nlevitt commented 6 years ago

Some more documentation around libhdfs3 would be helpful. It's difficult to figure out which of these is most canonical, and how they relate to each other.

https://github.com/Pivotal-Data-Attic/pivotalrd-libhdfs3 https://github.com/apache/incubator-hawq/tree/master/depends/libhdfs3 https://github.com/martindurant/libhdfs3-downstream https://github.com/ContinuumIO/libhdfs3-downstream https://github.com/bdrosen96/libhdfs3

The readme points to pivotalrd-libhdfs3 but that one does not seem to work with this library (missing hdfsCreateDirectoryEx). I found that function was added here https://github.com/martindurant/libhdfs3-downstream/commit/868cd49db7b56 which was cherry picked from the bdrosen96 fork. So I tried building (on mac) the head of https://github.com/martindurant/libhdfs3-downstream but I ran into problems (don't remember what exactly). I had had success on linux using the package supplied by anaconda, and I found that https://anaconda.org/conda-forge/libhdfs3/files was built "1 month and 29 days ago". So I looked for the commit on the martindurant fork that corresponded roughly to that date. Now I'm working from https://github.com/martindurant/libhdfs3-downstream/tree/7842951deab2d and I'm still getting build errors, but it feels like I'm getting close to success.

But this is crazy. It would be great if the readme could clarify and give some guidance on building or otherwise obtaining libhdfs3.

martindurant commented 6 years ago

Yes, you are quite right that the situation around libhdf3 is unfortunate. Let me describe the history:

Thanks for pointing out the link in the README. I have updated it. However, any help you can give in testing and/or fixing the situation would be vastly appreciated!

martindurant commented 6 years ago

As for OSX builds, this should be doable now, and it is on my plan to make conda packages. I have had local builds that have worked, but it requires rebuilding the full chain of C dependencies. People don't tend to use hdfs on mac, but they should be able to.

nlevitt commented 6 years ago

Thanks, it is really useful to know that history!

I was able to build https://github.com/ContinuumIO/libhdfs3-downstream without too much trouble. First I had to build and install googletest manually though. This note in https://github.com/ContinuumIO/libhdfs3-downstream/blob/master/libhdfs3/README.md seems to be a lie? :)

To run tests, the following libraries are needed.

gtest (tested on 1.7.0)         already integrated in the source code
gmock (tested on 1.7.0)         already integrated in the source code

After that the trickiest part was convincing it to link to brew's openssl instead of the system's. The --dependency argument eventually worked.

$ git clone git@github.com:ContinuumIO/libhdfs3-downstream.git
$ cd libhdfs3-downstream/libhdfs3
$ mkdir build
$ cd build
$ ../bootstrap --prefix=/usr/local --dependency=$(brew --prefix openssl)
$ make
$ make install
martindurant commented 6 years ago

The actual build we use (for linux) is here: https://github.com/conda-forge/libhdfs3-feedstock/tree/master/recipe , I should have pointed that out earlier. It applies a diff to exclude the requirement for gtest and gmock entirely. The README is taken from HAWQ, and I think they are included in the greater repo that libhdfs3 is a part of, but I'm not certain, it could just be out of date.

Note that you should also have been able to use openssl in the conda lib directory.