fsspec / gcsfs

Pythonic file-system interface for Google Cloud Storage
http://gcsfs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
320 stars 141 forks source link

retry bulk rm #608

Closed martindurant closed 3 months ago

martindurant commented 4 months ago

Fixes #558

martindurant commented 4 months ago

cc @slevang @mil-ad - this can't really be tested in CI

martindurant commented 4 months ago

I did some cleanup here, if people wouldn't mind trying again.

mil-ad commented 4 months ago

Thanks @martindurant ! I'll try reproducing tomorrow

slevang commented 4 months ago

On the latest commit I'm getting failures like this now:

  File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/zarr/storage.py", line 1531, in rmdir
    self.fs.rm(store_path, recursive=True)
  File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
                ^^^^^^^^^^
  File "/opt/miniconda3/envs/salient/lib/python3.11/site-packages/gcsfs/core.py", line 1286, in _rm
    raise errors[0]
OSError: {    "code": 503,    "message": "Server is not responding",    "errors": [      {        "message": "Server is not responding",        "domain": "global",        "reason": "backendError"      }    ]  }

When I use the previous commit, today I am also getting 503 instead of 429 but the failed files are correctly filtered out and retried.

martindurant commented 4 months ago

I made a small change to batch the leftovers, if any, instead of repeating for each batch.

I don't see how 503 is escaping, unless you are running out of retries (could add logging here):

                    if code in [200, 204]:
                        out.append(path)
                    elif code in errs and i < 5:
                        remaining.append(path)
                    else:
                        ...

should mean 503 either gets the path back in remaining and is retried, so long as retries are left. Maybe the server really was flaky? errs contains [500, 501, 502, 503, 504, 408, 429].

slevang commented 4 months ago

Of course I can't trip any errors now that I try again (regardless of commit). I'll try again tomorrow.

When I hit the 503 earlier, I did have a print statement that should have logged if anything went into remaining but I never saw that, suggesting it was not getting flagged as retriable somehow.

slevang commented 4 months ago

On batch size, Google says this:

You should not include more than 100 calls in a single batch request. If you need to make more calls than that, use multiple batch requests. The total batch request payload must be less than 10MB.

martindurant commented 4 months ago

I was thinking that with the requests concurrent, smaller batches would be better. Maybe that's wrong, since sending the requests should take about the same bandwidth regardless.

We have two batch sizes here: in the outer _rm and the inner _rm_files. Should they be the same?

slevang commented 4 months ago

I'm not sure but that seems ok. Then the outer _rm ends up handling all the batching and each call to _rm_files only runs a single chunk. So the code could actually be simplified to drop a loop from _rm_files right?

Also should a user be able to configure batchsize from the public API? AbstractFileSystem.rm() doesn't pass through additional **kwargs.

martindurant commented 4 months ago

Actually, it's this one: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/asyn.py#L319

And you are right, if _rm_files is not designed to be called from outside, there's no reason it should do its own internal looping.

slevang commented 3 months ago

Thanks!