Closed seeholza closed 1 month ago
@martindurant here is a branch with your proposed fix.
The tests here pass, but I do see weird behavior when running an actual workload. I double checked and the failure appears by simply bumping the version of gcsfs to this branch.
│ File "/usr/local/lib/python3.11/site-packages/fsspec/asyn.py", line 118, in wrapper │
│ return sync(self.loop, func, *args, **kwargs) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/usr/local/lib/python3.11/site-packages/fsspec/asyn.py", line 103, in sync │
│ raise return_result │
│ File "/usr/local/lib/python3.11/site-packages/fsspec/asyn.py", line 56, in _runner │
│ result[0] = await coro │
│ ^^^^^^^^^^ │
│ File "/usr/local/lib/python3.11/site-packages/gcsfs/core.py", line 1030, in _ls │
│ for entry in await self._list_objects( │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/usr/local/lib/python3.11/site-packages/gcsfs/core.py", line 563, in _list_objects │
│ clisting = self._ls_from_cache(path) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/usr/local/lib/python3.11/site-packages/fsspec/spec.py", line 379, in _ls_from_cache │
│ raise FileNotFoundError(path)
The call comes from GCSFilesystem().ls(path)
, but the path in question exists and the command works on ~gcsfs==2024.9.0
~ gcsfs==2023.12.2post1
(see below). Could be something with the authentication ends up not working?
I applied the fix in my run (writing hundreds of Parquet files with ~1b rows in total), and it works 🙂 Before the job was hanging consistently after 1h.
I do see weird behavior when running an actual workload.
This causes the workflow to fail? Are you running multiple threads too? Missing credentials should not cause FileNotFound but PermissionError, and the traceback indicates this is coming from the directory listing cache - doing reading?
I do see weird behavior when running an actual workload.
This causes the workflow to fail? Are you running multiple threads too? Missing credentials should not cause FileNotFound but PermissionError, and the traceback indicates this is coming from the directory listing cache - doing reading?
Yes this causes a failure. We're not running with multiple threads, it's the first (reading) call in the process, which uses GCSFilesystem().ls(path)
.
Edit: I am disabling caching and double checking, might be an unrelated issue on my side where i don't invalidate caches properly. It's just weird it didn't appear before.
Yes this causes a failure. We're not running with multiple threads, it's the first (reading) call in the process, which uses GCSFilesystem().ls(path).
So I understand clearly: ls
(alone and with no other operations) works with current main, but not with this patch?
@martindurant I did notice that pre-patch I accidentally ran gcsfs==2023.12.2.post1
, which didn't contain https://github.com/fsspec/gcsfs/pull/612. I checked, and updating to 2024.9.0
causes the same ls
failure already, so this is unrelated. Sorry for the red herring. I now call ls(path, refresh=True)
and this works as intended. Preliminary runs confirm that I do not get the race condition/hanging anymore.
So, as to this bugfix, I think we're good, opening the PR for review.
The following is therefore unrelated to the PR, but just to answer your question @martindurant :
So I understand clearly:
ls
(alone and with no other operations) works with current main, but not with this patch?
What I do exactly (in pseudocode) is:
gcsfs.GCSFileSystem().exists(parent_path)
is called earlierpath/subdir1/subdir2
with content in subdir1
and subdir2
gcsfs.GCSFileSystem().ls(path/subdir1/subdir2)
is called later, and this is the one that fails, only after the patch.So, if the cache persists somehow between instantiations of GCSFileSystem
, then that could be an explanation for the failure of the ls
.
gcsfs.GCSFileSystem()
will get the same instance each time, by design, since creating credentials and connections is expensive. It will also persist the directory cache. You can avoid this by passing skip_instance_cache=True
, but it would be worth finding out what's going on.
extra code creates new paths
If this is from a different instance/process/machine, then it is to be expected that the directory cache of the instance above has become stale, so refresh
is the right thing to do (or .invalidate_cache()
).
gcsfs.GCSFileSystem() will get the same instance each time, by design, since creating credentials and connections is expensive. It will also persist the directory cache. You can avoid this by passing skip_instance_cache=True, but it would be worth finding out what's going on.
Thats very useful information @martindurant, thank you very much! I was trying to find something along these lines in the documentation, but failed to. If this is not in there yet, would it make sense adding it?
It is an inherited behaviour from fsspec, mentioned here: https://filesystem-spec.readthedocs.io/en/latest/features.html#instance-caching
I totally agree that the whole project and set of repos could do with a docs reorganise, but I think it's beyond my personal effort level right now.
See https://github.com/fsspec/filesystem_spec/issues/565#issuecomment-2396993568