dask / hdfs3

A wrapper for libhdfs3 to interact with HDFS from Python
http://hdfs3.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
136 stars 40 forks source link

After HDFS wire encryption enabled on secure cluster, hdfs3 is not working #146

Open pkasinathan opened 6 years ago

pkasinathan commented 6 years ago

Team,

After we enabled HDFS wire encryption on secure cluster, hdfs3 cat/put commands are not working. hdfs3 ls commands. i.e. Basically any command that reads/write data from/into HDFS are failing.

Log:

>>> hdfs = HDFileSystem(host="myhadoop",pars={"dfs.nameservices": "myhadoop","dfs.ha.namenodes.myhadoop": "nn1,nn2","dfs.namenode.rpc-address.myhadoop.nn1": "nn1.example.com:8020","dfs.namenode.rpc-address.myhadoop.nn2": "nn2.example.com:8020","hadoop.security.authentication": "kerberos"})
>>> hdfs.ls("/user/prabhu/examples/pi.py");
[{'block_size': 268435456, 'size': 1412, 'permissions': 493, 'replication': 3, 'name': '/user/prabhu/examples/pi.py/', 'group': 'hdfs', 'kind': 'file', 'last_mod': 1497747805, 'owner': 'prabhu', 'last_access': 1512156876}]
>>> hdfs.cat("/user/prabhu/examples/pi.py");
2017-12-01 11:36:28.999667, p10629, th139720396728064, ERROR cannot setup block reader for Block: [block pool ID: BP-220915007-10.1.1.228-1495762903442 block ID 1074465167_724376] file /user/prabhu/examples/pi.py on Datanode: dn0055.example.com(10.1.1.77).
TcpSocket.cpp: 75: HdfsNetworkException: Read 8 bytes failed from "10.1.1.77:1019": (errno: 104) Connection reset by peer
    @   Hdfs::Internal::TcpSocketImpl::read(char*, int)
    @   Hdfs::Internal::BufferedSocketReaderImpl::readVarint32(int, int)
    @   Hdfs::Internal::BufferedSocketReaderImpl::readVarint32(int)
    @   Hdfs::Internal::RemoteBlockReader::checkResponse()
    @   Hdfs::Internal::RemoteBlockReader::RemoteBlockReader(Hdfs::Internal::ExtendedBlock const&, Hdfs::Internal::DatanodeInfo&, Hdfs::Internal::PeerCache&, long, long, Hdfs::Internal::Token const&, char const*, bool, Hdfs::Internal::SessionConfig&)
    @   Hdfs::Internal::InputStreamImpl::setupBlockReader(bool)
    @   Hdfs::Internal::InputStreamImpl::readOneBlock(char*, int, bool)
    @   Hdfs::Internal::InputStreamImpl::readInternal(char*, int)
    @   Hdfs::Internal::InputStreamImpl::read(char*, int)
    @   hdfsRead
    @   ffi_call_unix64
    @   ffi_call
    @   _ctypes_callproc
    @   PyCFuncPtr_call
    @   PyObject_Call
    @   PyEval_EvalFrameEx
    @   _PyEval_EvalCodeWithName
    @   PyEval_EvalFrameEx
    @   _PyEval_EvalCodeWithName
    @   PyEval_EvalFrameEx
    @   PyEval_EvalFrameEx
    @   _PyEval_EvalCodeWithName
    @   PyEval_EvalCodeEx
    @   PyEval_EvalCode
    @   PyRun_InteractiveOneObject
    @   PyRun_InteractiveLoopFlags
    @   PyRun_AnyFileExFlags
    @   Py_Main
    @   main
    @   __libc_start_main
    @   Unknown

retry another node
2017-12-01 11:36:29.073274, p10629, th139720396728064, ERROR cannot setup block reader for Block: [block pool ID: BP-220915007-10.1.1.228-1495762903442 block ID 1074465167_724376] file /user/prabhu/examples/pi.py on Datanode: dn0010.example.com(10.1.1.20).
TcpSocket.cpp: 75: HdfsNetworkException: Read 8 bytes failed from "10.1.1.20:1019": (errno: 104) Connection reset by peer
    @   Hdfs::Internal::TcpSocketImpl::read(char*, int)
    @   Hdfs::Internal::BufferedSocketReaderImpl::readVarint32(int, int)
    @   Hdfs::Internal::BufferedSocketReaderImpl::readVarint32(int)
    @   Hdfs::Internal::RemoteBlockReader::checkResponse()
    @   Hdfs::Internal::RemoteBlockReader::RemoteBlockReader(Hdfs::Internal::ExtendedBlock const&, Hdfs::Internal::DatanodeInfo&, Hdfs::Internal::PeerCache&, long, long, Hdfs::Internal::Token const&, char const*, bool, Hdfs::Internal::SessionConfig&)
    @   Hdfs::Internal::InputStreamImpl::setupBlockReader(bool)
    @   Hdfs::Internal::InputStreamImpl::readOneBlock(char*, int, bool)
    @   Hdfs::Internal::InputStreamImpl::readInternal(char*, int)
    @   Hdfs::Internal::InputStreamImpl::read(char*, int)
    @   hdfsRead
    @   ffi_call_unix64
    @   ffi_call
    @   _ctypes_callproc
    @   PyCFuncPtr_call
    @   PyObject_Call
    @   PyEval_EvalFrameEx
    @   _PyEval_EvalCodeWithName
    @   PyEval_EvalFrameEx
    @   _PyEval_EvalCodeWithName
    @   PyEval_EvalFrameEx
    @   PyEval_EvalFrameEx
    @   _PyEval_EvalCodeWithName
    @   PyEval_EvalCodeEx
    @   PyEval_EvalCode
    @   PyRun_InteractiveOneObject
    @   PyRun_InteractiveLoopFlags
    @   PyRun_AnyFileExFlags
    @   Py_Main
    @   main
    @   __libc_start_main
    @   Unknown

retry another node
2017-12-01 11:36:29.130773, p10629, th139720396728064, ERROR cannot setup block reader for Block: [block pool ID: BP-220915007-10.1.1.228-1495762903442 block ID 1074465167_724376] file /user/prabhu/examples/pi.py on Datanode: dn0028.example.com(10.1.1.38).
TcpSocket.cpp: 75: HdfsNetworkException: Read 8 bytes failed from "10.1.1.38:1019": (errno: 104) Connection reset by peer
    @   Hdfs::Internal::TcpSocketImpl::read(char*, int)
    @   Hdfs::Internal::BufferedSocketReaderImpl::readVarint32(int, int)
    @   Hdfs::Internal::BufferedSocketReaderImpl::readVarint32(int)
    @   Hdfs::Internal::RemoteBlockReader::checkResponse()
    @   Hdfs::Internal::RemoteBlockReader::RemoteBlockReader(Hdfs::Internal::ExtendedBlock const&, Hdfs::Internal::DatanodeInfo&, Hdfs::Internal::PeerCache&, long, long, Hdfs::Internal::Token const&, char const*, bool, Hdfs::Internal::SessionConfig&)
    @   Hdfs::Internal::InputStreamImpl::setupBlockReader(bool)
    @   Hdfs::Internal::InputStreamImpl::readOneBlock(char*, int, bool)
    @   Hdfs::Internal::InputStreamImpl::readInternal(char*, int)
    @   Hdfs::Internal::InputStreamImpl::read(char*, int)
    @   hdfsRead
    @   ffi_call_unix64
    @   ffi_call
    @   _ctypes_callproc
    @   PyCFuncPtr_call
    @   PyObject_Call
    @   PyEval_EvalFrameEx
    @   _PyEval_EvalCodeWithName
    @   PyEval_EvalFrameEx
    @   _PyEval_EvalCodeWithName
    @   PyEval_EvalFrameEx
    @   PyEval_EvalFrameEx
    @   _PyEval_EvalCodeWithName
    @   PyEval_EvalCodeEx
    @   PyEval_EvalCode
    @   PyRun_InteractiveOneObject
    @   PyRun_InteractiveLoopFlags
    @   PyRun_AnyFileExFlags
    @   Py_Main
    @   main
    @   __libc_start_main
    @   Unknown
.....
.....
.....
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.5/site-packages/hdfs3/core.py", line 475, in cat
    result = f.read()
  File "/opt/anaconda3/lib/python3.5/site-packages/hdfs3/core.py", line 633, in read
    out = self.read(2**16)
  File "/opt/anaconda3/lib/python3.5/site-packages/hdfs3/core.py", line 649, in read
    raise IOError('Read file %s Failed:' % self.path, -ret)
OSError: [Errno Read file /user/prabhu/examples/pi.py Failed:] 1
>>> 

Can you let me know whether HDF3 support wire encryption? or, Am I missing any configuration?

Please let me know.

Thanks Prabhu

martindurant commented 6 years ago

This is with "privacy" setting, correct?

This is a known efficiency in hdfs3 (specifically, libhdfs3/libgsasl). You could try again after updating hdfs3 and libhdfs3 to latest, and then downloading and conda-installing the file https://anaconda.org/mdurant/libgsasl/1.8.1/download/linux-64/libgsasl-1.8.1-1.tar.bz2 .

ayushiagarwal commented 6 years ago

Hi,

We upgraded the hdfs3 and libhdfs3 to the latest version, but still when we are trying to put any thing it does not put anything:

hdfs = HDFileSystem(host="myhadoop",pars={"dfs.nameservices": "myhadoop","dfs.ha.namenodes.myhadoop": "nn1,nn2","dfs.namenode.rpc-address.myhadoop.nn1": "nn1.example.com:8020","dfs.namenode.rpc-address.myhadoop.nn2": "nn2.example.com:8020","hadoop.security.authentication": "kerberos"})

hdfs.put("/user/prabhu/examples/pi.py"); It still does not work. Am i still missing any configuration?

martindurant commented 6 years ago

Unfortunately, getting all the security configurations working via libhdfs3 has proved very problematic. I now recommend that you switch to arrow's hdfs module instead. It deals much better with configuration and security, and doesn't miss much of what is provided by hdfs3.

sangramga commented 5 years ago

Is this issue resolved? I am facing the same issue while accessing ENCRYPTED(Transparent Data Encryption) HDFS (with Kerberos).

martindurant commented 5 years ago

No, this is not resolved, so the recommendation to use pyarrow stands.