fsspec / gcsfs

Pythonic file-system interface for Google Cloud Storage
http://gcsfs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
345 stars 146 forks source link

Remove race condition in credentials.py #643

Closed seeholza closed 1 month ago

seeholza commented 1 month ago

See https://github.com/fsspec/filesystem_spec/issues/565#issuecomment-2396993568

seeholza commented 1 month ago

@martindurant here is a branch with your proposed fix.

The tests here pass, but I do see weird behavior when running an actual workload. I double checked and the failure appears by simply bumping the version of gcsfs to this branch.

│   File "/usr/local/lib/python3.11/site-packages/fsspec/asyn.py", line 118, in wrapper                                                                                           │
│     return sync(self.loop, func, *args, **kwargs)                                                                                                                               │
│            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                               │
│   File "/usr/local/lib/python3.11/site-packages/fsspec/asyn.py", line 103, in sync                                                                                              │
│     raise return_result                                                                                                                                                         │
│   File "/usr/local/lib/python3.11/site-packages/fsspec/asyn.py", line 56, in _runner                                                                                            │
│     result[0] = await coro                                                                                                                                                      │
│                 ^^^^^^^^^^                                                                                                                                                      │
│   File "/usr/local/lib/python3.11/site-packages/gcsfs/core.py", line 1030, in _ls                                                                                               │
│     for entry in await self._list_objects(                                                                                                                                      │
│                  ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                      │
│   File "/usr/local/lib/python3.11/site-packages/gcsfs/core.py", line 563, in _list_objects                                                                                      │
│     clisting = self._ls_from_cache(path)                                                                                                                                        │
│                ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                        │
│   File "/usr/local/lib/python3.11/site-packages/fsspec/spec.py", line 379, in _ls_from_cache                                                                                    │
│     raise FileNotFoundError(path)

The call comes from GCSFilesystem().ls(path), but the path in question exists and the command works on ~gcsfs==2024.9.0~ gcsfs==2023.12.2post1 (see below). Could be something with the authentication ends up not working?

reineking commented 1 month ago

I applied the fix in my run (writing hundreds of Parquet files with ~1b rows in total), and it works 🙂 Before the job was hanging consistently after 1h.

martindurant commented 1 month ago

I do see weird behavior when running an actual workload.

This causes the workflow to fail? Are you running multiple threads too? Missing credentials should not cause FileNotFound but PermissionError, and the traceback indicates this is coming from the directory listing cache - doing reading?

seeholza commented 1 month ago

I do see weird behavior when running an actual workload.

This causes the workflow to fail? Are you running multiple threads too? Missing credentials should not cause FileNotFound but PermissionError, and the traceback indicates this is coming from the directory listing cache - doing reading?

Yes this causes a failure. We're not running with multiple threads, it's the first (reading) call in the process, which uses GCSFilesystem().ls(path).

Edit: I am disabling caching and double checking, might be an unrelated issue on my side where i don't invalidate caches properly. It's just weird it didn't appear before.

martindurant commented 1 month ago

Yes this causes a failure. We're not running with multiple threads, it's the first (reading) call in the process, which uses GCSFilesystem().ls(path).

So I understand clearly: ls (alone and with no other operations) works with current main, but not with this patch?

seeholza commented 1 month ago

@martindurant I did notice that pre-patch I accidentally ran gcsfs==2023.12.2.post1, which didn't contain https://github.com/fsspec/gcsfs/pull/612. I checked, and updating to 2024.9.0 causes the same ls failure already, so this is unrelated. Sorry for the red herring. I now call ls(path, refresh=True) and this works as intended. Preliminary runs confirm that I do not get the race condition/hanging anymore.

So, as to this bugfix, I think we're good, opening the PR for review.


The following is therefore unrelated to the PR, but just to answer your question @martindurant :

So I understand clearly: ls (alone and with no other operations) works with current main, but not with this patch?

What I do exactly (in pseudocode) is:

So, if the cache persists somehow between instantiations of GCSFileSystem, then that could be an explanation for the failure of the ls.

martindurant commented 1 month ago

gcsfs.GCSFileSystem() will get the same instance each time, by design, since creating credentials and connections is expensive. It will also persist the directory cache. You can avoid this by passing skip_instance_cache=True, but it would be worth finding out what's going on.

extra code creates new paths

If this is from a different instance/process/machine, then it is to be expected that the directory cache of the instance above has become stale, so refresh is the right thing to do (or .invalidate_cache()).

seeholza commented 1 month ago

gcsfs.GCSFileSystem() will get the same instance each time, by design, since creating credentials and connections is expensive. It will also persist the directory cache. You can avoid this by passing skip_instance_cache=True, but it would be worth finding out what's going on.

Thats very useful information @martindurant, thank you very much! I was trying to find something along these lines in the documentation, but failed to. If this is not in there yet, would it make sense adding it?

martindurant commented 1 month ago

It is an inherited behaviour from fsspec, mentioned here: https://filesystem-spec.readthedocs.io/en/latest/features.html#instance-caching

I totally agree that the whole project and set of repos could do with a docs reorganise, but I think it's beyond my personal effort level right now.