Open adampinky85 opened 3 months ago
Hi, thanks for your report. We will look into those (already put up a PR to fix the list case).
For context: The Arrow FS now leverages a rewrite of the parquet implementation that's a lot faster than the legacy implementation (a few rough edges are still expected unfortunately)
Great thanks! I believe glob is not supported in Arrrow’s s3 FS - that may be the issue for the glob case. The list case fix is much appreciated 😀
Yeah, glob patterns are not supported by Arrow FS. I don't have a solution to this yet but at the same time I'm not entirely convinced this is even necessary.
For example, instead of s3://{bucket}/{key}/*.parquet
you should be able to use s3://{bucket}/{key}
. Accepting a list of files should be fine, though.
Thanks, yes glob pattern example was a trivial to show the exception. In our real use cases, it is useful to load only the targeted subset of files. e.g,, load all files for a specific year, month, and various identifiers. But the list path case fix is great and resolves the main issue.
Describe the issue:
Hi team,
We are intensive users of Dask and it's a great product!
We use Apache Arrow's
pyarrow.fs.S3FileSystem
in our ecosystem rather thans3fs.S3FileSystem
due to performance and deadlock issues that were found during multiprocessing.We're able to retrieve single files and entire directories of many files successfully. But using either a glob path or a interable of paths the
read_parquet
API throws exceptions with maximum recursion depth.If would really helpful if the team can investigate? Many thanks!
Minimal Verifiable Example:
Exception:
Environment: OS: Amazon Linux release 2 (Karoo) Linux: 4.14.336-257.566.amzn2.x86_64 Python: 3.12.2
Packages: arrow: 1.3.0 dask: 2024.3.1 dask-expr: 1.0.4 numpy: 1.26.4 pandas: 2.2.1 pyarrow: 15.0.2 pyarrow-hotfix: 0.6 Install method pip