crs4 / pydoop

A Python MapReduce and HDFS API for Hadoop
Apache License 2.0
236 stars 59 forks source link

2.0.0 recursive list fails #362

Closed dhopp97 closed 5 years ago

dhopp97 commented 5 years ago

Create a directory in hdfs:

hdfs dfs -mkdir /test

With pydoop 1.1.0 this works:

>>> import pydoop.hdfs
>>> pydoop.hdfs.ls('/test', recursive=True)
[]
>>> quit()

With pydoop 2.0.0 it fails

>>> import pydoop.hdfs
>>> pydoop.hdfs.ls('/test', recursive=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "testenv/lib/python2.7/site-packages/pydoop/hdfs/__init__.py", line 307, in ls
    dir_list = lsl(hdfs_path, user, recursive)
  File "testenv/lib/python2.7/site-packages/pydoop/hdfs/__init__.py", line 292, in lsl
    if top['kind'] == 'directory':
TypeError: string indices must be integers, not str

Setting recursive=False does indeed work.

>>> pydoop.hdfs.ls('/test', recursive=False)
[]
dhopp97 commented 5 years ago

I should have mentioned that I was still using python 2.7 (we are working on python 3 support). This issues stems from this code:

https://github.com/crs4/pydoop/blob/0fbd2899c001922faff10cd0d10727fbf40353e0/pydoop/hdfs/fs.py#L39

https://github.com/crs4/pydoop/blob/0fbd2899c001922faff10cd0d10727fbf40353e0/pydoop/utils/py3compat.py#L98

So in fs.py a custom basestring is being imported and in py3compat.py if not using python3 basestring is being set to be the same as unicode and this causes 'isinstance('/test', basestring) to equate to False.

Why in the case of using python2 is basestring being forced to be the same as unicode? If I patch this to make basestring=basestring am I going to break something more deep down?

simleo commented 5 years ago

Yes, the custom compatibility layer forces things a bit in the case of basestring. I've already pushed a patch to my fork, it's currently being tested. The main thing is that hdfs methods actually expect text objects, but that can be surprising to the Python 2 user, so I'm patching it.

I'm glad to hear you're moving to Python 3, since we plan to remove support for Python 2 at some point (might be 2020 as the rest of the community). Anyway, you should be able to make your code work with the current Pydoop version by passing unicode objects:

pydoop.hdfs.ls(u'/test', recursive=True)
dhopp97 commented 5 years ago

Upgrading pydoop to 2.0.0 was one of the first steps in migrating to python 3. Yeah I knew I could force the string to be unicode but didn't see that called out in the docs that things were expected to be unicode.

Before going down the route of fixing all the places we use pydoop wanted to check if this was expected and if a simple patch to pydoop to work the 'old' way inside of python2 was feasible.