In [1]: import pyarrow.fs
In [2]: c = pyarrow.fs.HadoopFileSystem()
In [3]: sel = pyarrow.fs.FileSelector('/user/rwiumli')
In [4]: c.get_target_stats(sel)
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-4-f92157e01e47> in <module>
----> 1 c.get_target_stats(sel)
~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.get_target_stats()
~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
OSError: HDFS list directory failed, errno: 2 (No such file or directory)
In [5]: sel = pyarrow.fs.FileSelector('.')
In [6]: c.get_target_stats(sel)
Out[6]:
[<FileStats for 'sandeep': type=FileType.Directory>,
<FileStats for 'venv': type=FileType.Directory>,
<FileStats for 'sample.py': type=FileType.File, size=506>]
In [7]: !ls
sample.py sandeep venv
In [8]:
It looks like the new hadoop fs interface is doing a local lookup?
Ok fine...
In [8]: sel = pyarrow.fs.FileSelector('hdfs:///user/rwiumli') # shouldnt have to do this
In [9]: c.get_target_stats(sel)
hdfsGetPathInfo(hdfs:///user/rwiumli): getFileInfo error:
IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:593)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1418)
hdfsListDirectory(hdfs:///user/rwiumli): FileSystem#listStatus error:
IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:410)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1566)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1609)
at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:667)
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-9-f92157e01e47> in <module>
----> 1 c.get_target_stats(sel)
~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.get_target_stats()
~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
OSError: HDFS list directory failed, errno: 22 (Invalid argument)
In [10]:
and heres the rub
In [10]: c = pyarrow.hdfs.HadoopFileSystem()
20/03/27 09:16:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
In [11]: c.ls('/user/rwiumli')
Out[11]:
['hdfs://nameservice/user/rwiumli/.Trash',
'hdfs://nameservice/user/rwiumli/.sparkStaging',
'hdfs://nameservice/user/rwiumli/.staging',
'hdfs://nameservice/user/rwiumli/acceptance',
'hdfs://nameservice/user/rwiumli/copy_test',
'hdfs://nameservice/user/rwiumli/hive-site.xml',
'hdfs://nameservice/user/rwiumli/mli',
'hdfs://nameservice/user/rwiumli/model_63702762843888.txt',
'hdfs://nameservice/user/rwiumli/oozie-oozi',
'hdfs://nameservice/user/rwiumli/sqoop',
'hdfs://nameservice/user/rwiumli/test',
'hdfs://nameservice/user/rwiumli/test_all.yml',
'hdfs://nameservice/user/rwiumli/user']
In [12]:
Antoine Pitrou / @pitrou:
I can't reproduce with PyArrow 4.0.0. [~yalwan-iqvia] If you still encounter this problem on the latest PyArrow version, feel free to ping.
I'll preface this with the limited setup I had to do:
export CLASSPATH=$(hadoop classpath --glob)
export ARROW_LIBHDFS_DIR=/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib64
Then I ran the following:
It looks like the new hadoop fs interface is doing a local lookup?
Ok fine...
and heres the rub
Finally, system info:
Reporter: Yaqub Alwan
Note: This issue was originally created as ARROW-8240. Please see the migration documentation for further details.