S3 URL containing versionId is misinterpreted as glob pattern

mdwint commented 9 months ago

I am trying to load a specific version of an S3 object into a Dask dataframe as follows:

>>> import dask.dataframe as dd
>>> dd.read_parquet("s3://example-bucket/dataset/part.0.parquet?versionId=se654wJoRQcirhKoHRkN4hsFmhNwMg27")

This example returns an empty DataFrame, even though the S3 object version exists. If I omit ?versionId=..., it successfully loads the latest version.

According the s3fs docs, the versionId query parameter is supported (given that version_aware=True is set in the filesystem config).

I stepped through this in a debugger, and ended up in the expand_paths_if_needed function. This function interprets the path as a glob pattern because it contains a ? character (in ?versionId=...). This glob pattern matches nothing, since no such S3 keys exist, and an empty list of paths is returned.

I'm not sure if this needs to be fixed in ffspec or dask. Versioned S3 objects seem to be supported by s3fs, but the higher-level fsspec doesn't expect query parameters in URLs.

Maybe this use case wasn't intended. Do you think it's feasible to support it?

martindurant commented 9 months ago

Yes, this should be supported. The trouble is, that dask (in particular) expects the possibility of multiple files in the input, and so wants to check for the possibility of a glob pattern.

I believe a simple workaround may be to put your path inside a list (["s3://example-bucket/..."]) to prevent globbing.

mdwint commented 9 months ago

Ah, I was using a list of paths initially, and that also did not work. I then simplified the example above to one path.

EDIT: For reference, this was my debugging session:

> /Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py(518)read_parquet()
-> fs, paths, dataset_options, open_file_options = engine.extract_filesystem(
(Pdb) c
> /Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py(361)extract_filesystem()
-> if filesystem is None:
(Pdb) c
...
> /Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py(518)read_parquet()
-> fs, paths, dataset_options, open_file_options = engine.extract_filesystem(
(Pdb) c
...
(Pdb) p paths
['example-bucket/.../part.0.parquet?versionId=AJI11R54wutqOH.EHdMxYl7jT855cHUp', 'example-bucket/.../part.1.parquet?versionId=mCsohxpcaNPaix1jeMgOig06cPkOJ8Ec', 'example-bucket/.../part.2.parquet?versionId=se654wJoRQcirhKoHRkN4hsFmhNwMg27']
(Pdb) n
> /Users/matteo/.../.venv/lib/python3.11/site-packages/fsspec/core.py(598)get_fs_token_paths()
-> paths = expand_paths_if_needed(paths, mode, num, fs, name_function)
(Pdb) n
> /Users/matteo/.../.venv/lib/python3.11/site-packages/fsspec/core.py(607)get_fs_token_paths()
-> return fs, fs._fs_token, paths
(Pdb) p paths
[]

This was using dask==2023.3.0 and fsspec==2023.1.0. I can try testing with the latest versions too. The system I'm working on depends on these older versions for now.

martindurant commented 9 months ago

Do you mind trying with pandas alone without dask?

mdwint commented 9 months ago

Do you mind trying with pandas alone without dask?

It does work with pandas (for a single URL, since it doesn't support multiple).

>>> url
's3://example-bucket/.../part.0.parquet?versionId=AJI11R54wutqOH.EHdMxYl7jT855cHUp'
>>> pd.read_parquet(url)
             date  half_hour     value
...

[110500 rows x 3 columns]

martindurant commented 9 months ago

The fsspec.open used by pandas is also calling get_fs_token_paths, but it seems differently. I note that this is going via arrow, so I am tempted to ask you to try with fastparquet... but it should work anyway. I don't immediately see how one might distinguish a genuine glob pattern from a path+query, however. The list version should work, so maybe that's what I need to fix first.

mdwint commented 9 months ago

I note that this is going via arrow, so I am tempted to ask you to try with fastparquet

Sure! Loading a single path using fastparquet==2023.10.1 works fine, but multiple paths don't. At some point the list of paths becomes empty (probably for the same reason as above), and we crash in dask code:

Traceback (most recent call last):
  File "/Users/matteo/.../.venv/lib/python3.11/site-packages/dask/backends.py", line 133, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py", line 543, in read_parquet
    read_metadata_result = engine.read_metadata(
                           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 902, in read_metadata
    dataset_info = cls._collect_dataset_info(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 453, in _collect_dataset_info
    paths, base, fns = _sort_and_analyze_paths(paths, fs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/utils.py", line 505, in _sort_and_analyze_paths
    base, fns = _analyze_paths(file_list, fs, root=root)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/matteo/.../.venv/lib/python3.11/site-packages/dask/dataframe/io/parquet/utils.py", line 571, in _analyze_paths
    basepath = path_parts_list[0][:-1]
               ~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

This crash could be because I'm on the old dask==2023.3.0 and it doesn't expect an empty list here, but in any case: no paths were returned. I can try again with compatible versions of dask and fastparquet if you think that matters.

martindurant commented 5 months ago

Can you try with the current release, or the current master, where you can pass expand=False as a storage_option

ranchodeluxe commented 4 months ago

FWIW, I'm able to read parquet files with different version identifiers using the following package versions on s3 without problems:

s3fs 2024.3.1 fsspec 2024.3.1 dask 2024.4.0

fsspec / filesystem_spec

S3 URL containing versionId is misinterpreted as glob pattern #1439