Open mdwint opened 9 months ago
Yes, this should be supported. The trouble is, that dask (in particular) expects the possibility of multiple files in the input, and so wants to check for the possibility of a glob pattern.
I believe a simple workaround may be to put your path inside a list (["s3://example-bucket/..."]
) to prevent globbing.
Ah, I was using a list of paths initially, and that also did not work. I then simplified the example above to one path.
EDIT: For reference, this was my debugging session:
> /Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py(518)read_parquet()
-> fs, paths, dataset_options, open_file_options = engine.extract_filesystem(
(Pdb) c
> /Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py(361)extract_filesystem()
-> if filesystem is None:
(Pdb) c
...
> /Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py(518)read_parquet()
-> fs, paths, dataset_options, open_file_options = engine.extract_filesystem(
(Pdb) c
...
(Pdb) p paths
['example-bucket/.../part.0.parquet?versionId=AJI11R54wutqOH.EHdMxYl7jT855cHUp', 'example-bucket/.../part.1.parquet?versionId=mCsohxpcaNPaix1jeMgOig06cPkOJ8Ec', 'example-bucket/.../part.2.parquet?versionId=se654wJoRQcirhKoHRkN4hsFmhNwMg27']
(Pdb) n
> /Users/matteo/.../.venv/lib/python3.11/site-packages/fsspec/core.py(598)get_fs_token_paths()
-> paths = expand_paths_if_needed(paths, mode, num, fs, name_function)
(Pdb) n
> /Users/matteo/.../.venv/lib/python3.11/site-packages/fsspec/core.py(607)get_fs_token_paths()
-> return fs, fs._fs_token, paths
(Pdb) p paths
[]
This was using dask==2023.3.0
and fsspec==2023.1.0
. I can try testing with the latest versions too. The system I'm working on depends on these older versions for now.
Do you mind trying with pandas alone without dask?
Do you mind trying with pandas alone without dask?
It does work with pandas (for a single URL, since it doesn't support multiple).
>>> url
's3://example-bucket/.../part.0.parquet?versionId=AJI11R54wutqOH.EHdMxYl7jT855cHUp'
>>> pd.read_parquet(url)
date half_hour value
...
[110500 rows x 3 columns]
The fsspec.open
used by pandas is also calling get_fs_token_paths
, but it seems differently. I note that this is going via arrow, so I am tempted to ask you to try with fastparquet... but it should work anyway. I don't immediately see how one might distinguish a genuine glob pattern from a path+query, however. The list version should work, so maybe that's what I need to fix first.
I note that this is going via arrow, so I am tempted to ask you to try with fastparquet
Sure! Loading a single path using fastparquet==2023.10.1
works fine, but multiple paths don't. At some point the list of paths becomes empty (probably for the same reason as above), and we crash in dask code:
Traceback (most recent call last):
File "/Users/matteo/.../.venv/lib/python3.11/site-packages/dask/backends.py", line 133, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py", line 543, in read_parquet
read_metadata_result = engine.read_metadata(
^^^^^^^^^^^^^^^^^^^^^
File "/Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 902, in read_metadata
dataset_info = cls._collect_dataset_info(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 453, in _collect_dataset_info
paths, base, fns = _sort_and_analyze_paths(paths, fs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/utils.py", line 505, in _sort_and_analyze_paths
base, fns = _analyze_paths(file_list, fs, root=root)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/utils.py", line 571, in _analyze_paths
basepath = path_parts_list[0][:-1]
~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
This crash could be because I'm on the old dask==2023.3.0
and it doesn't expect an empty list here, but in any case: no paths were returned. I can try again with compatible versions of dask and fastparquet if you think that matters.
Can you try with the current release, or the current master, where you can pass expand=False as a storage_option
FWIW, I'm able to read parquet files with different version identifiers using the following package versions on s3 without problems:
s3fs 2024.3.1 fsspec 2024.3.1 dask 2024.4.0
I am trying to load a specific version of an S3 object into a Dask dataframe as follows:
This example returns an empty DataFrame, even though the S3 object version exists. If I omit
?versionId=...
, it successfully loads the latest version.According the s3fs docs, the
versionId
query parameter is supported (given thatversion_aware=True
is set in the filesystem config).I stepped through this in a debugger, and ended up in the
expand_paths_if_needed
function. This function interprets the path as a glob pattern because it contains a?
character (in?versionId=...
). This glob pattern matches nothing, since no such S3 keys exist, and an empty list of paths is returned.I'm not sure if this needs to be fixed in
ffspec
ordask
. Versioned S3 objects seem to be supported bys3fs
, but the higher-levelfsspec
doesn't expect query parameters in URLs.Maybe this use case wasn't intended. Do you think it's feasible to support it?