Open asfimport opened 2 years ago
Prem Sagar Gali / @galipremsagar: From the stack trace this issue seems to be similar to: https://issues.apache.org/jira/browse/ARROW-10923, but 10923 doesn't have a reproducer.
Antoine Pitrou / @pitrou: Arrow has its own S3 filesystem that gets used when you pass the URI as a string, but if you pass a FSSPec filesystem instance, then a compatibility layer is being used, and it might have bugs (and/or FSSpec changed its semantics slightly).
Antoine Pitrou / @pitrou: cc @jorisvandenbossche for the FSSpec compatibility issue.
Joris Van den Bossche / @jorisvandenbossche: It might be that this is not related to passing a native vs fsspec filesystem, but just that if you pass a list of strings, we assume that it is a list of files, and not a directory.
Joris Van den Bossche / @jorisvandenbossche: So reproducing this locally with a pyarrow local filesystem:
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
table = pa.table({'a': [1, 2, 3]})
pq.write_to_dataset(table, "test_parquet_dataset/")
In [9]: ds.dataset(["test_parquet_dataset/"], format="parquet", filesystem=LocalFileSystem())
---------------------------------------------------------------------------
IsADirectoryError Traceback (most recent call last)
<ipython-input-9-8e554a28b381> in <module>
----> 1 ds.dataset(["test_parquet_dataset/"], format="parquet", filesystem=LocalFileSystem())
~/scipy/repos/arrow/python/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
695 elif isinstance(source, (tuple, list)):
696 if all(_is_path_like(elem) for elem in source):
--> 697 return _filesystem_dataset(source, **kwargs)
698 elif all(isinstance(elem, Dataset) for elem in source):
699 return _union_dataset(source, **kwargs)
~/scipy/repos/arrow/python/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
435
436 if isinstance(source, (list, tuple)):
--> 437 fs, paths_or_selector = _ensure_multiple_sources(source, filesystem)
438 else:
439 fs, paths_or_selector = _ensure_single_source(source, filesystem)
~/scipy/repos/arrow/python/pyarrow/dataset.py in _ensure_multiple_sources(paths, filesystem)
356 raise FileNotFoundError(info.path)
357 elif file_type == FileType.Directory:
--> 358 raise IsADirectoryError(
359 'Path {} points to a directory, but only file paths are '
360 'supported. To construct a nested or union dataset pass '
IsADirectoryError: Path test_parquet_dataset/ points to a directory, but only file paths are supported. To construct a nested or union dataset pass a list of dataset objects instead.
So it also errors, although it gives a more clear error message about a directory not being supported (this error message comes from an additional check that we only do if the filesystem is local, I suppose because those checks can potentially be costly for remote filesystems).
Joris Van den Bossche / @jorisvandenbossche:
@galipremsagar just to be sure, but can you test pa.dataset.dataset(path[0], filesystem=fs, format="parquet")
with the fsspec filesystem? (so passing path[0]
instead of path
)
Prem Sagar Gali / @galipremsagar: @jorisvandenbossche Yup, that worked for me:
In [7]: pa.dataset.dataset(path[0], filesystem=fs, format="parquet")
Out[7]: <pyarrow._dataset.FileSystemDataset at 0x7fb2b32300f0>
When an s3 file system as
file_system
is passed to pyarrow.dataset.dataset API and thesource
is a directory name with bucket, there is an error:Reporter: Prem Sagar Gali / @galipremsagar
Note: This issue was originally created as ARROW-16438. Please see the migration documentation for further details.