ORNL / DataFed

A Federated Scientific Data Management System
https://ornl.github.io/DataFed/
Other
18 stars 14 forks source link

System - Operations on large collections take too much time and time out client #420

Closed dvstans closed 4 years ago

dvstans commented 4 years ago

Some operations like deleting a collection containing thousands of records take a very long time to run. This causes the operation to timeout on the client side. Need to identify potentially long-delay commands and break them into an initial confirmation part and a background task part to avoid these client timeouts.

dvstans commented 4 years ago

Investigation of deletion of large collections found that the majority of time spent is due to loading each record in order to acquire data size for correction/update of per-allocation statistics. Apparently reading is slower than writing because it must be synchronous whereas writing does not.

dvstans commented 4 years ago

The solution to this problem requires a refactoring of how background tasks are initialized. Currently, the init process does some potentially expensive processing which can cause client timeouts (as well as block other tasks from running). Instead, the task init stage will simply record what the client has requested in a new task (with no processing or blocking), and immediately return the task ID to the client. These initial tasks will then be run in the background where init processing will be performed (such as permissions checks, concurrency analysis), then the task will either fail or proceed to the next stage, which is either blocked or ready depending on concurrency with other tasks.

dvstans commented 4 years ago

Update: There was no need (yet) to refactor the task code since all expensive operations can already be placed in the "run" function rather than the "init" function. However, for deletions, this leaves an opening for access to the items between init and completion of the task. Deletes were intentionally placed in the init with exclusive locks to ensure the operation was atomic. A possible work-around is to isolate all items somehow, such as removing ACLs and owner/creator fields, or by marking them in some way to prevent access. This may also be expensive however (in the init function).

dvstans commented 4 years ago

Works as is but there is a small window of opportunity for negative interaction from other users (resulting in a task failure). Because this is improbable and of low impact (just retry operation), will close this issue.