Closed wkozaczuk closed 9 months ago
Could it possibly be related to #1024 and/or #951?
Also, I think it should not be difficult to create unit tests to validate all this provided we understand how exactly this linked list in waitqueue
should work. The unit tests would not even need to do any thread synchronization logic, just create an empty waitqueue
and wait_object
s and only exercise arm()
and disarm()
(the woken()
should return false = just keep non-null t
in wait_record
s) and assert on waitqueue
s newest
and oldest
and possibly walk next
just to validate things work correctly.
Wow, good catch, I think you are right, the loop in disarm() indeed seems wrong when _wr
is last (newest) link in the chain. I haven't looked at this code in many years (@avikivity wrote it 9 years ago), so maybe I'm missing something. When you have a final version of your proposed patch, I'll try to review it.
Please note that in the beginning, pnext points to fifo.oldest, so "pnext = _wr.next" sets fifo.oldest, so your code `fifo.oldest = pnext;` should be redundant, I don't know if it matters or not (as I said I didn't fully review your patch).
Do you have a theory how this bug might have caused https://github.com/cloudius-systems/osv/issues/1024 or https://github.com/cloudius-systems/osv/issues/951? Maybe it can somehow allow the same object to be woken twice?
I think you are right - this part is redundant:
if (!older) {
fifo.oldest = *pnext;
}
No, the loop is wrong when _wr
is the oldest in the chain and there are 2 wait records in the waitqueue.
I could not figure out what the intention was behind the 2nd part of this statement if (!*pnext || !(*pnext)->next)
? Maybe @avikivity still remembers after these 9 years :-). It looks like dropping the 2nd part and making it if (!*pnext)
like in my patch fixes this bug.
Anyway, I will try to write those unit tests to validate and prove my theory.
I think you are right - this part is redundant:
if (!older) { fifo.oldest = *pnext; }
No, the loop is wrong when
_wr
is the oldest in the chain and there are 2 wait records in the waitqueue.I could not figure out what the intention was behind the 2nd part of this statement
if (!*pnext || !(*pnext)->next)
? Maybe @avikivity still remembers after these 9 years :-). It looks like dropping the 2nd part and making itif (!*pnext)
like in my patch fixes this bug.
I absolutely don't remember, and you shouldn't trust the memory of someone who can't code a singly linked list either.
When testing the java http server when running on OSv with Linux dynamic linker, I occasionally (1 out of 3 times) would see this crash in the
sched::wait_object<waitqueue>::arm()
:Initially, I thought the
arm()
method is buggy:and changing it to:
would fix it but instead it led to another crash in
disarm()
this time:see
disarm()
below:Then I realized that the original crash showed this state of
fifo
:and this is wrong - if
newest
is 0 then oldest should also be0
if I understand how this linked list should work. So something corrupts it. What?Here is my theory: the 2 other places that modify
fifo
-void waitqueue::wake_one(mutex& mtx)
andvoid waitqueue::wake_all(mutex& mtx)
seem to work fine. But thedisarm()
is buggy I think.Imagine this scenario: the waitqueue has two
wait_record
s andoldest
points to the older one andnewest
to the newer one and the older onenext
points to the newer one. Then thedisarm()
is called with_wr
equal to thefifo.oldest
so the loop ends up doing this:oldest
is equal to_wr
:oldest
is assigned_wr.next
which I believe is whatfifo.newest
points toif (!*pnext || !(*pnext)->next)
is false but the 2nd is true sofifo.newest
is assignednewest
which at this point (1st iteration) isnullptr
fifo
is left: older not null, newest is null (just like the original crash exposed).The correct version should look like this I think:
After applying this patch I could not see these crashes. But who knows I may be missing something else and/or my reasoning is wrong here.