fsspec / gcsfs

Pythonic file-system interface for Google Cloud Storage
http://gcsfs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
320 stars 141 forks source link

FileNotFoundError since 2024.3.1 #616

Closed shobsi closed 3 months ago

shobsi commented 3 months ago

https://github.com/googleapis/python-bigquery-dataframes has a dependency on fsspec. We started noticing our tests are failing since version 2024.3.1:

This is how the initial environment looks like

(venv1) $ pip freeze | grep "pandas\|google-cloud-storage\|gcsfs\|fsspec"
fsspec==2024.2.0
gcsfs==2024.2.0
google-cloud-storage==2.16.0
pandas==2.2.1

pandas.read_csv works as expected

(venv1) $ python -c 'import pandas as pd; print(pd.read_csv("gs://bigframes-dev-testing/bigframes_tests_system_20240319012411_f8ab6/test_to_csv_tabs*.csv", sep="\\t"))'
<string>:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
   rowindex bool_col                 bytes_col    date_col         datetime_col                               geography_col    int64_col  int64_too   numeric_col   float64_col  rowindex_2         string_col         time_col                   timestamp_col
0         0     True      SGVsbG8sIFdvcmxkIQ==  2021-07-21  2021-07-21 11:39:45              POINT(-122.0838511 37.3860517)  123456789.0          0  1.234568e+00  1.250000e+00           0      Hello, World!  11:41:43.076160  2021-07-21 17:43:43.945289 UTC
1         1    False      44GT44KT44Gr44Gh44Gv  1991-02-03  1991-01-02 03:45:06                       POINT(-71.104 42.315) -987654321.0          1  1.234568e+00  2.510000e+00           1              こんにちは  11:14:34.701606  2021-07-21 17:43:43.945289 UTC
2         2     True      wqFIb2xhIE11bmRvIQ==  2023-03-01  2023-03-01 10:55:13  POINT(-0.124474760143016 51.5007826749545)     314159.0          0  1.011010e+02  2.500000e+10           2     ¡Hola Mundo!    23:59:59.999999  2023-03-01 10:55:13.250125 UTC
3         3      NaN                       NaN         NaN                  NaN                                         NaN          NaN          1           NaN           NaN           3               None             None                            None
4         4    False      44GT44KT44Gr44Gh44Gv  2021-07-21                  NaN                                         NaN    -234892.0      -2345           NaN           NaN           4      Hello, World!             None                            None
5         5    False          R8O8dGVuIFRhZw==  1980-03-14  1980-03-14 15:16:17                                         NaN      55555.0          0  5.555555e+00  5.555550e+02           5         Güten Tag!  15:16:17.181921  1980-03-14 15:16:17.181921 UTC
6         6     True  SGVsbG8JQmlnRnJhbWVzIQc=  2023-05-23  2023-05-23 11:37:01      MULTIPOINT(20 20, 10 40, 40 30, 30 10)  101202303.0          2 -1.009081e+01 -1.234560e+02           6  capitalize, This   01:02:03.456789  2023-05-23 11:42:55.000001 UTC
7         7     True                       NaN  2038-01-20  2038-01-19 03:14:08                                         NaN -214748367.0          2  1.111111e+07  4.242000e+01           7               سلام  12:00:00.000001  2038-01-19 03:14:17.999999 UTC
8         8    False                       NaN         NaN                  NaN                                         NaN          2.0          1           NaN  6.870000e+00           8                  T             None                            None

install gcsfs version 2024.3.1

(venv1) $ pip install "gcsfs==2024.3.1" -q
(venv1) $ pip freeze | grep "pandas\|google-cloud-storage\|gcsfs\|fsspec"
fsspec==2024.3.1
gcsfs==2024.3.1
google-cloud-storage==2.16.0
pandas==2.2.1

rerun the same command, it fails now

(venv1) $ python -c 'import pandas as pd; print(pd.read_csv("gs://bigframes-dev-testing/bigframes_tests_system_20240319012411_f8ab6/test_to_csv_tabs*.csv", sep="\\t"))'
<string>:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds) 
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
    self.handles = get_handle(
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/pandas/io/common.py", line 728, in get_handle
    ioargs = _get_filepath_or_buffer(
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/pandas/io/common.py", line 432, in _get_filepath_or_buffer
    ).open()
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/fsspec/core.py", line 135, in open
    return self.__enter__()
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/fsspec/core.py", line 103, in __enter__
    f = self.fs.open(self.path, mode=mode) 
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/fsspec/spec.py", line 1293, in open
    f = self._open(
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/gcsfs/core.py", line 1581, in _open
    return GCSFile(
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/gcsfs/core.py", line 1746, in __init__
    super().__init__(
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/fsspec/spec.py", line 1651, in __init__
    self.size = self.details["size"]
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/gcsfs/core.py", line 1782, in details
    self._details = self.fs.info(self.path, generation=self.generation)
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/gcsfs/core.py", line 999, in _info
    out = await self._ls(path, **kwargs)   
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/gcsfs/core.py", line 1028, in _ls
    for entry in await self._list_objects( 
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/gcsfs/core.py", line 589, in _list_objects
    return [await self._get_object(path)]  
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/gcsfs/core.py", line 525, in _get_object
    res = await self._call(
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/gcsfs/core.py", line 445, in _call
    status, headers, info, contents = await self._request(
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/decorator.py", line 221, in fun
    return await caller(func, *(extras + args), **kw)
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/gcsfs/retry.py", line 126, in retry_request
    return await func(*args, **kwargs)
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/gcsfs/core.py", line 438, in _request
    validate_response(status, contents, path, args)
  File "/usr/local/google/home/shobs/code/bigframes/venv1/lib/python3.10/site-packages/gcsfs/retry.py", line 95, in validate_response
    raise FileNotFoundError(path)
FileNotFoundError: b/bigframes-dev-testing/o/bigframes_tests_system_20240319012411_f8ab6%2Ftest_to_csv_tabs%2A.csv
martindurant commented 3 months ago

duplicate: https://github.com/fsspec/s3fs/issues/862

https://github.com/fsspec/filesystem_spec/pull/1551 allowed you to pass expand=True , to enforce finding the first matching file when expecting only one file from an open() with a globstring. This would match the old behaviour, which was unintended.

@Skylion007 , it seems like your use case of paths you do NOT want expanded may be the in the minority. I may change the default value of expand= to match the previous code-path, and then in your code you would need to pass expand=False explicity. Thoughts?

cofin commented 3 months ago

Just to add another datapoint to this. I have been impacted by this change when reading within duckdb. A way to globally revert to the previous behavior would be greatly appreciated.

Skylion007 commented 3 months ago

This actually a major footgun in pandas right now. If I understand correctly, this glob behavior actually will have different behavior with local file filesystems and fsspec.open (or at least with other libraries like dask). This behavior before was undocumented and a bug, it also mean you just couldn't open certain files before. I know this is a breaking change, but it's breaking behavior that really shouldn't be supported. fsspec.open is suppose to be mostly a dropin replacement for builtins.open and therefore they should share similar semantics.

If pandas want's to support this by default, they are welcome to change their use of the APIs to opt-in to this behavior by default.

martindurant commented 3 months ago

I think it's fair to say that pandas will make not changes on our behalf, it's up to us to decide what the most expected and most useful behaviours are.

Skylion007 commented 3 months ago

Yeah, I would say the special glob characters are actually quite common file[train].csv is a common convention for instance