dask / hdfs3

A wrapper for libhdfs3 to interact with HDFS from Python
http://hdfs3.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
136 stars 40 forks source link

In HA-mode, HDFileSystem() doesn't initialize when port is not given or None is given. #132

Closed priancho closed 6 years ago

priancho commented 6 years ago

Hi,

First, thank you for a great library! It is very helpful utility to use HDFS-related functionalities in Python.

While using hdfs3 in HA-mode Hadoop, I am failing to use HA configuration with following commands.

hdfs = HDFileSystem()
# or hdfs = HDFileSystem(host="mycluster", port=None)

It fails with the following error message.

/home/test/.conda/envs/py3/lib/python3.6/site-packages/hdfs3/core.py:127: UserWarning: Setting conf parameter port failed
  warnings.warn('Setting conf parameter %s failed' % par)

While digging the cause in the source file, I found that the following codes are suspicious.

    def connect(self):
        if conf['port'] is not None:     # <---- Don't set port if it is None
            _lib.hdfsBuilderSetNameNodePort(o, conf.pop('port'))
        _lib.hdfsBuilderSetNameNode(o, ensure_bytes(conf.pop('host')))
        ...
        for par, val in conf.items():    # <---- But, this loop uses ALL items in conf including port.
            if not _lib.hdfsBuilderConfSetStr(
                    o, ensure_bytes(par), ensure_bytes(val)) == 0:
                warnings.warn('Setting conf parameter %s failed' % par)

When I added if-else statement that just skips allocation if par=="port", HDFileSystem can be properly initialized. It seems that port/user/etc. properties that are set before this loop shouldn't be used inside this loop as in the following code:

https://github.com/priancho/hdfs3/blob/bugfix_SkipAlreadyEvaluatedConnectionProperties/hdfs3/core.py#L125

        for par, val in conf.items():
            if par in ['port', 'user', 'ticket_cache', 'token']:
                continue

            if not _lib.hdfsBuilderConfSetStr(o, ensure_bytes(par),
                                              ensure_bytes(val)) == 0:
                warnings.warn('Setting conf parameter %s failed' % par)

I would like to hear any comments on this since I wonder if this is a valid fix for the problem.

Best wishes, Han-Cheol

martindurant commented 6 years ago

I think you are exactly right. Would you like to contribute that code as a PR?

martindurant commented 6 years ago

Even simpler would be

port = conf.pop('port', None)
if port is not None:
    _lib.hdfsBuilderSetNameNodePort(o, port)

(like for the other parameters).

priancho commented 6 years ago

Sure, I will make PR with the the fix (and the code simplified with your suggestion).