Open xumengpanda opened 4 years ago
Can you give more details about the priority inversion? Like roughly how the priority works and how an inversion happens?
Can you give more details about the priority inversion? Like roughly how the priority works and how an inversion happens?
I assume the question is more about the definition of priority, since the issue description gave an example of the priority inversion.
The priority is flow's task priority. Each actor, when it waits, has a priority assigned. Each endpoint also has a priority assigned. The priority is define in TaskPriority
in the code. In FR, it is defined as
RestoreApplierWriteDB = 2310,
RestoreApplierReceiveMutations = 2300,
RestoreLoaderFinishVersionBatch = 2220,
RestoreLoaderSendMutations = 2210,
RestoreLoaderLoadFiles = 2200,
Note: priority inversion does not necessarily cause dead lock.
When fast restore (FR) pipeline-processes multiple version batches, loaders can process workload at a future version batch even when there is workload for the current in-progress version batch.
For example, the current in-progress (minimum) version batch index is 4. FR is asking loaders to send mutations to appliers for version batch 4. The sending mutation workload can be interfered by the workload for version batch 5 - 7 that asks loaders to parse backup files and send mutations.
The interference may waste resource by leaving FDB cluster idle. The interference exists because FR does not differentiate the priorities of the same type of actors for different version batches. For example, the actors that parse backup files on loaders have the same priority for all version batches.
Challenge: A version batch's actor priority shall change when FR finishes processing a version batch. For example, VB 7 may have lowest priority when FR is processing VB 4. But when FR finishes processing VB 6, VB 7 will have highest priority. We need a way to dynamically assign priority to actors.
Possible solution: Each restore role knows the largest finished version batch index. When a restore actor starts, we assign the priority based on the finished version batch index. Whenever the actor is unblocked and runs, we re-calculate the priority the actor should be and compare with the start priority. If the new priority does not match, we should re-assign the new priority to the actor and yield.
This ensures: (1) actors in future version batch do not block actors in current version batch; (2) do not leave nodes idle when they have pending work to do.
Another solution: Evan suggested we may also use a priority queue to queue the requests and have our own logic to dispatch these requests based on version batch number. The reference code is https://github.com/apple/foundationdb/blob/release-6.1/fdbserver/MasterProxyServer.actor.cpp#L131
Update: Based on offline discussion with @dongxinEric , I removed the priority inversion in the issue because it didn't correctly describe the issue.