apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.35k stars 3.49k forks source link

[Python] hdfs fails to connect to for HDFS 3.x cluster #25136

Open asfimport opened 4 years ago

asfimport commented 4 years ago

I'm trying to use the pyarrow hdfs connector with Hadoop 3.1.3 and I get an error that looks like a protobuf or jar mismatch problem with Hadoop. The same code works on a Hadoop 2.9 cluster.   I'm wondering if there is something special I need to do or if pyarrow doesn't support Hadoop 3.x yet? Note I tried with pyarrow 0.15.1, 0.16.0, and 0.17.1.       import pyarrow as pa     hdfs_kwargs = dict(host="namenodehost",                       port=9000,                       user="tgraves",                       driver='libhdfs',                       kerb_ticket=None,                       extra_conf=None)     fs = pa.hdfs.connect(**hdfs_kwargs)     res = fs.exists("/user/tgraves")   Error that I get on Hadoop 3.x is:   dfsExists: invokeMethod((Lorg/apache/hadoop/fs/Path;)Z) error: ClassCastException: org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Messagejava.lang.ClassCastException: org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)         at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:904)         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)         at java.lang.reflect.Method.invoke(Method.java:498)         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)         at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)         at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)         at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)         at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661)         at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)         at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)         at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683)

Reporter: Thomas Graves

Note: This issue was originally created as ARROW-9019. Please see the migration documentation for further details.

asfimport commented 4 years ago

Andy: Think it is related to library version.  I had similar issue - resolved by running export LD_LIBRARY_PATH=/opt/anaconda/3.7.1/lib/:$LD_LIBRARY_PATH prior to executing the code (you'll need to use your own python distro lib path).

 

asfimport commented 4 years ago

Thomas Graves: can you give more details on what was missing?  I used the exact same setup and it worked with Hadoop 2.9. 

asfimport commented 4 years ago

Andy: Sorry, went back and retried and realized I had a different error related to kerberos and the 'libkrb5support.so' (we have HDP 3.1.5 with kerberos).

I'm able to do (with pyarrow 0.13.0 and anaconda disto 3.7.1):

import pyarrow as pa fs = pa.hdfs.connect() # <- here I use the defaults res = fs.exists("/user/")

 

as long as I make sure I'm pointing at the correct libkrb5support library. 

 

asfimport commented 3 years ago

Bradley Miro: Hello! I'm on the GCP Dataproc team and was wondering if there's been any progress or workarounds for this? I am attempting to support a Horovod + Dataproc integration but this keeps popping up as a blocker to finishing the integration. Any help would be appreciated :) 

asfimport commented 3 years ago

Thomas Graves: ping on this again, any information or ideas on working around or fixing?

asfimport commented 3 years ago

Bradley Miro: Hey [~tgraves],

This ended up working once I did this:


export CLASSPATH=$(hadoop classpath --glob)

(assuming the Hadoop binary is in your PATH)

asfimport commented 3 years ago

Thomas Graves: [~bradmiro]  I don't really understand how that fixes the issue, the hadoop classpath is already included when a container launchs on yarn, in this case I launched Spark on yarn and the hadoop classpath should already be there.  Now the only thing I can think of is if this caused the order of things in the classpath to change

asfimport commented 3 years ago

Thomas Graves: Note I was able to finally test this and on dataproc at least setting the classpath did work around the issue.  It must be a jar file order issue.  In this case though I set it and manually started pyspark.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: cc @ianmcook

asfimport commented 3 years ago

Ian Cook / @ianmcook: Since this can be solved by setting CLASSPATH as described in the above comments, perhaps we should review the code in python/pyarrow/hdfs.py which automatically sets CLASSPATH to check for any faulty logic there.

FYI, according to the comments in that file, pyarrow.hdfs.connect is deprecated and pyarrow.fs.HadoopFileSystem should be used instead. I'm not sure if that has any bearing on this issue. (Update: see the related issue ARROW-13141)

asfimport commented 2 years ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.