Closed martindurant closed 3 months ago
cc @slevang @mil-ad - this can't really be tested in CI
I did some cleanup here, if people wouldn't mind trying again.
Thanks @martindurant ! I'll try reproducing tomorrow
On the latest commit I'm getting failures like this now:
File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/zarr/storage.py", line 1531, in rmdir
self.fs.rm(store_path, recursive=True)
File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/fsspec/asyn.py", line 118, in wrapper
return sync(self.loop, func, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/fsspec/asyn.py", line 103, in sync
raise return_result
File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/fsspec/asyn.py", line 56, in _runner
result[0] = await coro
^^^^^^^^^^
File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/gcsfs/core.py", line 1286, in _rm
raise errors[0]
OSError: { "code": 503, "message": "Server is not responding", "errors": [ { "message": "Server is not responding", "domain": "global", "reason": "backendError" } ] }
When I use the previous commit, today I am also getting 503 instead of 429 but the failed files are correctly filtered out and retried.
I made a small change to batch the leftovers, if any, instead of repeating for each batch.
I don't see how 503 is escaping, unless you are running out of retries (could add logging here):
if code in [200, 204]:
out.append(path)
elif code in errs and i < 5:
remaining.append(path)
else:
...
should mean 503 either gets the path back in remaining
and is retried, so long as retries are left. Maybe the server really was flaky? errs
contains [500, 501, 502, 503, 504, 408, 429].
Of course I can't trip any errors now that I try again (regardless of commit). I'll try again tomorrow.
When I hit the 503 earlier, I did have a print statement that should have logged if anything went into remaining
but I never saw that, suggesting it was not getting flagged as retriable somehow.
On batch size, Google says this:
You should not include more than 100 calls in a single batch request. If you need to make more calls than that, use multiple batch requests. The total batch request payload must be less than 10MB.
I was thinking that with the requests concurrent, smaller batches would be better. Maybe that's wrong, since sending the requests should take about the same bandwidth regardless.
We have two batch sizes here: in the outer _rm and the inner _rm_files. Should they be the same?
I'm not sure but that seems ok. Then the outer _rm
ends up handling all the batching and each call to _rm_files
only runs a single chunk. So the code could actually be simplified to drop a loop from _rm_files
right?
Also should a user be able to configure batchsize
from the public API? AbstractFileSystem.rm() doesn't pass through additional **kwargs
.
Actually, it's this one: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/asyn.py#L319
And you are right, if _rm_files is not designed to be called from outside, there's no reason it should do its own internal looping.
Thanks!
Fixes #558