adobe / da-live

Dark Alley is a research project
https://da.live
Apache License 2.0
7 stars 11 forks source link

Improve rename reliability and performance #165

Open auniverseaway opened 1 month ago

auniverseaway commented 1 month ago

As an author, I would like to have a reliable way to rename content so I can easily manage my content.

Additional context

DA's storage is backed by an S3-compatible object storage service and uses keys / prefixes to present as a filesystem. This is high performance and high availability, but it comes with one very large downside: any change to a parent requires all children to also be changed. If you have a parent folder with 1000 items, 1001 items need to be moved. The long term answer is probably some kind of index, but we are not ready for that.

The goal with this ticket is to make any "quick win" improvements. First would be with reliability, and second would be with speed. In some ways, both are related because rename is not reliable currently due to speed: if a request takes longer than 10 minutes, the request times out.

High level approach

If we break this up on the client, we can have concurrent requests to rename children at the same time. Instead of one request trying to rename 1000 objects, we could split it up over however many children (5?) into 200 objects per request.

I would prefer if we look into breaking this up on the server so we can keep a consistently performant API regardless of implementation. This is only feasible if we can come back with a response in a reasonable amount of time. I think there are clever ways to do this with KV or even durable objects so if we have something that goes past a certain timeout limit (2 minutes?) we can give a response, but somehow treat it like a "job" where the user can be informed when it is safe to work with the content.

I don't want to hop to a job / queue system before we look into doing concurrent copies / pastes / deletes. I'm 90% sure the copy / traversal code I wrote is not optimized from a performance perspective. Because there's not really a relationship between these keys, we should be able to copy & delete each key without impacting other copies.

Criteria of acceptance

  1. Rename (copy, paste, and delete) is consistently reliable to the end user.
  2. Any additional performance benefits are welcome.
bosschaert commented 1 month ago

We could try if the following makes a difference: https://github.com/adobe/da-admin/commit/dbdd5e37b3ac8718b14a89dadc1e1f0652caca3d

This moves from a single promise that did all the copy (or delete) operations in sequence, to multiple promises, one per operation which are fired off all at once.

auniverseaway commented 1 month ago

@bosschaert that’s a good start and along the lines of what I’m thinking.

CF has a sub-request concurrency limit of 1000, so we are bumping into that without batching the requests. There’s going to be sweet spot of concurrency and we probably want to throttle down to ~500-800. If I remember correctly, the list objects API gives us 1000 at a time and then a resume token for the next batch.

While I don’t see this being a problem, we also have to take into account if there are errors in the copy.

The other challenge we have is that we do all the copies in bulk. Which means if there is a failure in the copy or delete operation, you are stuck with two files which becomes even messier. Ideally rename behavior changes so if something fails midway, you only have the rest of the tree to finish and files live in one place or another.

Maybe to illustrate this…

Today:

Copy 1
Copy 2
Copy 3
Delete 1
Delete 2
Delete 3

Tomorrow:

Copy 1
Delete 1
Copy 2
Delete 2
Copy 3
Delete 3