Open asfimport opened 1 year ago
Alenka Frim / @AlenkaF:
Hi [~moritzmeister]
!
Could you try using pyarrow
directly to see if you then get the same error when opening the file?
You can instantiate a HadoopFileSystem
object from an URI string, or using the class constructor directly (https://arrow.apache.org/docs/dev/python/filesystems.html#hadoop-distributed-file-system-hdfs). Something similar to this:
from pyarrow import fs
hdfs, _ = fs.HadoopFileSystem.from_uri('hdfs://10.0.2.15:8020/Projects/testing/testing_Training_Datasets/transactions_view_fraud_batch_fv_1_1/validation/')
hdfs.open_input_file("/Projects/testing/testing_Training_Datasets/transactions_view_fraud_batch_fv_1_1/validation/part-00000-42b57ad2-57eb-4a63-bfaa-7375e82863e8-c000.csv")
If that works, you can then use hdfs
with {}fsspec{
}:
https://arrow.apache.org/docs/python/filesystems.html#using-arrow-filesystems-with-fsspec
and fsspec
API to open the files:
https://filesystem-spec.readthedocs.io/en/latest/api.html
Something similar to this:
from pyarrow import fs
hdfs = fs.HadoopFileSystem.from_uri('hdfs://10.0.2.15:8020/Projects/testing/testing_Training_Datasets/transactions_view_fraud_batch_fv_1_1/validation/')
from fsspec.implementations.arrow import ArrowFSWrapper
hdfs_fsspec = ArrowFSWrapper(hdfs)
hdfs_fsspec.open_files(...)
This way you can see if pyarrow 10.0.0 works or errors. And it is more direct so less likely to error :)
Also, do you maybe know if the Hadoop installation has changed in this time?
Hey! I am trying to read a CSV file using pyarrow together with fsspec from HDFS. I used to do this with pyarrow 9.0.0 and fsspec 2022.7.1, however, after I upgraded to pyarrow 10.0.0 this stopped working.
I am not quite sure if this is an incompatibility introduced in the new pyarrow version or if it is a Bug in fsspec. So if I am in the wrong place here, please let me know.
Apart from pyarrow 10.0.0 and fsspec 2022.7.1, I am using pandas version 1.3.3 and python 3.8.11.
Here is the full stack trace
However, if I leave out the namenode IP and port, it works as expected:
Any help is appreciated, thank you!
Environment: pyarrow 10.0.0 fsspec 2022.7.1 pandas 1.3.3 python 3.8.11. Reporter: Moritz Meister
Note: This issue was originally created as ARROW-18276. Please see the migration documentation for further details.