[C++/Python] Add troubleshooting section for setting up HDFS JNI interface - Githubissues

apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

https://arrow.apache.org/

Apache License 2.0

13.89k stars 3.38k forks source link

[C++/Python] Add troubleshooting section for setting up HDFS JNI interface #17331

Open asfimport opened 6 years ago

asfimport commented 6 years ago

The hadoop library directory contains a libhdfs.a and a libhadoop.so but no libhdfs.so.

Environment: linux trusty-cdh5 Reporter: Martin Durant / @martindurant

_{Note: This issue was originally created as ARROW-1313. Please see the migration documentation for further details.}

asfimport commented 6 years ago

Wes McKinney / @wesm: Can you provide more detail about your environment (i.e. so that this can be reproduced)? The location of libhdfs.so can vary a lot by Hadoop distribution.

asfimport commented 6 years ago

Martin Durant / @martindurant: Docker file: https://github.com/dask/hdfs3/blob/master/continuous_integration/Dockerfile

This uses an official .deb of CDH5, installed into /usr/lib/hadoop. There is no libhdfs.so anywhere in that directory..

Using java-7-openjdk-amd64.

asfimport commented 6 years ago

Wes McKinney / @wesm: It appears in this particular Hadoop distribution that libhdfs is packaged as a separate Linux package:

apt-get install libhdfs0-dev

and then

# find /usr -name \*.so -print | grep hdfs
/usr/lib/libhdfs.so

asfimport commented 6 years ago

Martin Durant / @martindurant: That would install the whole of hadoop as system packages, so there would be two separate ones with the CHD install from before. libhdfs.so is only 200kB, can it not be distributed?

asfimport commented 6 years ago

Wes McKinney / @wesm: My understanding is that the safest thing to do in production is use the libhdfs.so that is shipped with a particular Hadoop distribution (since there may be internal details that are particular to that version of Hadoop); while the public C API is the same between versions, in theory there could be internal details in the JNI implementation that break the Java "ABI". The Hadoop community would be able to give better advice