apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.52k stars 3.54k forks source link

[Python] FileInfo.path include port number #44074

Open wmkai opened 1 month ago

wmkai commented 1 month ago

Describe the usage question you have. Please include as many useful details as possible.

When I use hdfs = fs.HadoopFileSystem(host="xxx") and then

selector = fs.FileSelector(
    base_dir="/home/xxx",
    allow_not_found=False,
    recursive=False
)
file_infos = hdfs.get_file_info(selector)
print(file_info.path)

it prints :8020/home/xxx/

But I use another machine to run the same code, it prints /home/xxx/pretrained_models without :8020.

I check the version of python and pyarrow between these two machines, and found they are the same. Why does it return the port number? How do I get it to not return the port number?

Component(s)

Python

pitrou commented 1 month ago

The HadoopFileSystem finds and loads a third-party library on your system to actually access the HDFS filesystem. Depending on your system, the library is called hdfs.dll (Windows), libhdfs.so (Linux) or libhdfs.dylib (macOS).

Can you find where that library is located, how it was installed and what its version is?

Also, is :8020 part of your HDFS URL?