[A code causes deadlock] (Version: [rippled 1.6])

luleigreat commented 2 years ago

Hi, all I have discovered a code that can cause deadlock:

void
BatchWriter::store(std::shared_ptr<NodeObject> const& object)
{
    std::unique_lock<decltype(mWriteMutex)> sl(mWriteMutex);

    // If the batch has reached its limit, we wait
    // until the batch writer is finished
    while (mWriteSet.size() >= batchWriteLimitSize)
        mWriteCondition.wait(sl);

    mWriteSet.push_back(object);

    if (!mWritePending)
    {
        mWritePending = true;

        m_scheduler.scheduleTask(*this);
    }
}

void
NodeStoreScheduler::scheduleTask(NodeStore::Task& task)
{
    if (jobQueue_.isStopped())
        return;

    if (!jobQueue_.addJob(jtWRITE, "NodeObject::store", [&task](Job&) {
            task.performScheduledTask();
        }))
    {
        // Job not added, presumably because we're shutting down.
        // Recover by executing the task synchronously.
        task.performScheduledTask();
    }
}

If mWriteSet.size() >= batchWriteLimitSize condition met, all jobs in jobqueue may waiting for this condition variable, because the last time m_scheduler.scheduleTask execute success, it just added a job, but cannot assure the job will execute immediately. If all jobs are locked(that's the scene I have encountered: all jobs are either waiting for the InboundLedger::update lock or waiting for the mWriteCondition condition variable), and the performScheduledTask can never execute ,it will deadlock!

I am working on rippled 1.6 ,and I have reviewed the relating code on rippled:develop，it seems the problem still lay there.

Is this problem resolved ? I think the condition variable usage (mWriteCondition) can be removed, is there a better resolution?

HowardHinnant commented 2 years ago

Can you reliably reproduce this deadlocked state? If so, can you include directions for doing so? Thanks.

luleigreat commented 2 years ago

The [workers] is configured 9, and there are totally 9 jobs running when deadlock occured. The directions for doing so: the server have 16 core cpus with 3.0Ghz frequency, and only configured workers to 9, I think the reason this deadlock occured is the cpu ability is far more stronger than the disk io, and leading to a lot of node cannot be written to disk immediatly.

XRPLF / rippled

[A code causes deadlock] (Version: [rippled 1.6]) #4023