dandi / dandi-cli

DANDI command line client to facilitate common operations
https://dandi.readthedocs.io/
Apache License 2.0
22 stars 26 forks source link

Parallelize removal of extra files in Zarr #1411

Open yarikoptic opened 8 months ago

yarikoptic commented 8 months ago

related

jwodder commented 8 months ago

The client already batches Zarr entry deletions, 100 entries per request. I doubt doing multiple batches in parallel is going to result in faster turnaround from the server.

yarikoptic commented 8 months ago

why? doesn't it handle requests in parallel?

jwodder commented 8 months ago

@yarikoptic I can't find why the exact value of 100 was chosen, but I believe the point of the limit is to avoid making the server do too much work on a Zarr at once. Simultaneous requests would therefore mean too much work for the server.

@jjnesbitt @mvandenburgh et alii: Can you confirm or deny that there's no efficiency gain to be had from parallelizing batched Zarr entry deletion requests?

jjnesbitt commented 8 months ago

@jjnesbitt @mvandenburgh et alii: Can you confirm or deny that there's no efficiency gain to be had from parallelizing batched Zarr entry deletion requests?

That's correct, there would be no efficiency gain. This is for two reasons:

  1. The zarr is locked for update while processing the request, so if two were made in parallel, they would be processed sequentially.
  2. I believe the underlying reason why we limit the request size is because the API makes requests to the S3 API under the hood, in order to delete the requested zarrs files. Allowing for parallel invocation of this could cause transient errors from S3, which would complicate things very much.

Is the current performance of the zarr deletion endpoint causing problems elsewhere?

yarikoptic commented 8 months ago

Is the current performance of the zarr deletion endpoint causing problems elsewhere?

somewhat. From the log of the issue referenced in the OP: https://github.com/dandi/dandi-cli/issues/1410

❯ zgrep -e 'Deleting.*files' -e 'DELETE.*zarr' 20240223192140Z-306046.log.gz | head -n 2
2024-02-23T14:36:59-0500 [DEBUG   ] dandi 306046:140253474395840 sub-randomzarrlike/sub-randomzarrlike_junk.zarr: Deleting 226053 files in remote Zarr not present locally
2024-02-23T14:36:59-0500 [DEBUG   ] dandi 306046:140253474395840 DELETE https://api.dandiarchive.org/api/zarr/fd6ab3ea-cff6-4006-a9bf-acfa5d983985/files/
❯ zgrep 'DELETE.*zarr.*files/$' 20240223192140Z-306046.log.gz | tail -n 1
2024-02-23T18:14:40-0500 [DEBUG   ] dandi 306046:140253474395840 DELETE https://api.dandiarchive.org/api/zarr/fd6ab3ea-cff6-4006-a9bf-acfa5d983985/files/

so I believe it took over 3 hours to "merely" to delete (lots of) files in ZARR having finished upload of other files. Here such a drastic action was needed since I changed "chunking" strategy for a zarr, so would not be completely uncommon. So I thought it might be nice to get it speedier.