ORNL / DataFed

A Federated Scientific Data Management System
https://ornl.github.io/DataFed/
Other
18 stars 13 forks source link

System - Task stuck in pending state #523

Open dvstans opened 3 years ago

dvstans commented 3 years ago

During testing large/concurrent create/delete operations, a delete task became stuck in pending state. Restarting the core service caused the task to run and complete correctly. In addition a single collection was not deleted - possibly related to the data records.

JoshuaSBrown commented 1 year ago

I don't understand the exact problem. From the description it seems like there is more than one.

  1. Reliably and consistently handle a large number of concurrent requests
  2. Handling a request that is partially complete
  3. Resolving a partially complete request automatically

I don't know what this means and how it pertains to the problem:

"In addition, a single collection was not deleted"

Are collections not being deleted correctly?

dvstans commented 1 year ago

When a collection is deleted, all contained collections should be deleted, but in this case, one somehow survived - which is a bug. The delete task should not get stuck, I think this was due to very heavy loading on the DB. This was basically a stress test scenario and the system did not handle it well. I don't think any action can be taken on this issue until we have a way to recreate this issue with controlled stress testing.

JoshuaSBrown commented 1 year ago

I see, so the prereq for this issue is to develop a stress test suite and environment.

dvstans commented 1 year ago

FYI, this issue may be related to the recent issue with the Lehigh repository. I suspect there is an edge case with task scheduling that is causing valid ready tasks to never run.