apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.47k stars 3.52k forks source link

[Python] pyarrow dataset API fails to read s3 directory #31812

Open asfimport opened 2 years ago

asfimport commented 2 years ago

When an s3 file system as file_system is passed to pyarrow.dataset.dataset API and the source is a directory name with bucket, there is an error:


In [5]: from fsspec.core import get_fs_token_paths

In [6]: fs, _, path = get_fs_token_paths("s3://prem-rapids-test/folder/", mode="rb")

In [7]: fs
Out[7]: <s3fs.core.S3FileSystem at 0x7f3d02cc1460>

In [8]: path
Out[8]: ['prem-rapids-test/folder']

In [10]: pa.dataset.dataset(path, filesystem=fs, format="parquet")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 pa.dataset.dataset(path, filesystem=fs, format="parquet")

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pyarrow/dataset.py:670, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    668 elif isinstance(source, (tuple, list)):
    669     if all(_is_path_like(elem) for elem in source):
--> 670         return _filesystem_dataset(source, **kwargs)
    671     elif all(isinstance(elem, Dataset) for elem in source):
    672         return _union_dataset(source, **kwargs)

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pyarrow/dataset.py:422, in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    414 options = FileSystemFactoryOptions(
    415     partitioning=partitioning,
    416     partition_base_dir=partition_base_dir,
    417     exclude_invalid_files=exclude_invalid_files,
    418     selector_ignore_prefixes=selector_ignore_prefixes
    419 )
    420 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
--> 422 return factory.finish(schema)

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pyarrow/_dataset.pyx:1680, in pyarrow._dataset.DatasetFactory.finish()

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pyarrow/error.pxi:143, in pyarrow.lib.pyarrow_internal_check_status()

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pyarrow/_fs.pyx:1179, in pyarrow._fs._cb_open_input_file()

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pyarrow/fs.py:394, in FSSpecHandler.open_input_file(self, path)
    391 from pyarrow import PythonFile
    393 if not self.fs.isfile(path):
--> 394     raise FileNotFoundError(path)
    396 return PythonFile(self.fs.open(path, mode="rb"), mode="r")

FileNotFoundError: prem-rapids-test/folder
``` 

But it works only if the folder is passed as a full string:
```python

In [3]: import pyarrow.dataset

In [4]: pa.dataset.dataset("s3://prem-rapids-test/folder/", format="parquet")
Out[4]: <pyarrow._dataset.FileSystemDataset at 0x7f3ce502d870>

 

Reporter: Prem Sagar Gali / @galipremsagar

Note: This issue was originally created as ARROW-16438. Please see the migration documentation for further details.

asfimport commented 2 years ago

Prem Sagar Gali / @galipremsagar: From the stack trace this issue seems to be similar to: https://issues.apache.org/jira/browse/ARROW-10923, but 10923 doesn't have a reproducer.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Arrow has its own S3 filesystem that gets used when you pass the URI as a string, but if you pass a FSSPec filesystem instance, then a compatibility layer is being used, and it might have bugs (and/or FSSpec changed its semantics slightly).

See https://arrow.apache.org/docs/python/filesystems.html

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: cc @jorisvandenbossche for the FSSpec compatibility issue.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: It might be that this is not related to passing a native vs fsspec filesystem, but just that if you pass a list of strings, we assume that it is a list of files, and not a directory.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: So reproducing this locally with a pyarrow local filesystem:


import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

table = pa.table({'a': [1, 2, 3]})
pq.write_to_dataset(table, "test_parquet_dataset/")

In [9]: ds.dataset(["test_parquet_dataset/"], format="parquet", filesystem=LocalFileSystem())
---------------------------------------------------------------------------
IsADirectoryError                         Traceback (most recent call last)
<ipython-input-9-8e554a28b381> in <module>
----> 1 ds.dataset(["test_parquet_dataset/"], format="parquet", filesystem=LocalFileSystem())

~/scipy/repos/arrow/python/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    695     elif isinstance(source, (tuple, list)):
    696         if all(_is_path_like(elem) for elem in source):
--> 697             return _filesystem_dataset(source, **kwargs)
    698         elif all(isinstance(elem, Dataset) for elem in source):
    699             return _union_dataset(source, **kwargs)

~/scipy/repos/arrow/python/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    435 
    436     if isinstance(source, (list, tuple)):
--> 437         fs, paths_or_selector = _ensure_multiple_sources(source, filesystem)
    438     else:
    439         fs, paths_or_selector = _ensure_single_source(source, filesystem)

~/scipy/repos/arrow/python/pyarrow/dataset.py in _ensure_multiple_sources(paths, filesystem)
    356                 raise FileNotFoundError(info.path)
    357             elif file_type == FileType.Directory:
--> 358                 raise IsADirectoryError(
    359                     'Path {} points to a directory, but only file paths are '
    360                     'supported. To construct a nested or union dataset pass '

IsADirectoryError: Path test_parquet_dataset/ points to a directory, but only file paths are supported. To construct a nested or union dataset pass a list of dataset objects instead.

So it also errors, although it gives a more clear error message about a directory not being supported (this error message comes from an additional check that we only do if the filesystem is local, I suppose because those checks can potentially be costly for remote filesystems).

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: @galipremsagar just to be sure, but can you test pa.dataset.dataset(path[0], filesystem=fs, format="parquet") with the fsspec filesystem? (so passing path[0] instead of path)

asfimport commented 2 years ago

Prem Sagar Gali / @galipremsagar: @jorisvandenbossche Yup, that worked for me:


In [7]: pa.dataset.dataset(path[0], filesystem=fs, format="parquet")
Out[7]: <pyarrow._dataset.FileSystemDataset at 0x7fb2b32300f0>