Closed jespiron closed 2 months ago
Hello there,
First off thank you so much for your report!!! I really appreciate it.
I think another possible fix for this issue could be forcing a reschedule on the cpu that is running the thread. What are your thoughts about this solution?
Hey!! I appreciate your initiative in this project, it takes a lot of dedication to get it to this point 😄
As for forcing a reschedule: do you mean we force a reschedule on T0 before destroying it? This indeed prevents T0 from running again, since we remove T0 from the list, so the scheduler can't select it. We won't access freed memory, which is good. But in case T0 is running something like close
, we do want T0 to run to completion, to clean up the fd we created in open
and prevent a resource leak.
Basically, when T0 is in a syscall that's supposed to clean up something (freeing resources, unlocking something, etc.), how can we make sure it finishes its job?
Yeah your suggestion now makes a lot of sense taking this into mind as well. Now what if a thread locks a resource? Now we need to figure some way out if a thread is locking something and unlock when it gets killed.
For sure, you can search "asynchronous thread cancellation" vs "deferred thread cancellation" for more. It sounds like I'm suggesting towards deferred cancellation, but it's not the only way, and definitely many directions you can take this.
I'm thinking of going async since I think it would help us find any locking issues. If a thread dies without unlocking then there must be something. I would like to know your opinion however since you seem to have more experience in this field and would like to learn more.
Hey there! I have committed the fix 7fe8bfb59e50477849ce8aaf4c8adb30536bd527. I hope this resolves the issue!
Hello! Haha funny variable names :^)
The fix for this is more complicated than originally described. I believe the same issue is possible:
At a high level, yes we'd want to defer until a safe point. I named reschedule as one example, though for multiprocessor environments there's more to take into account. These safe points depend on how the rest of the kernel is implemented:
If T1 was in the middle of a syscall, a safe point can be before it IRET
's back into user mode (depending on how the rest of the kernel is implemented, this might also not be the full story).
Whatever fix you come up with, if you can run through each scenario of T0 and T1 and convince yourself that T1 finishes safely, then it's a good enough fix
Hi! Sorry replying later, I had exams going. So I should probably kill the thread when it returns back to userspace since by then it would have cleaned up anything it was doing in the kernel space.
Describe the bug Suppose CPU core 0 is running Thread 0 and CPU core 1 is running Thread 1. T0 is doing an openat syscall, relevant line is line 508. T1 is running thread_kill on T0 (i.e.
thrd==T0
), relevant line is call to kfreePossible races:
running_thread->mother_proc
(sincerunning_thread
will be freed!)thread_kill
destroying the TCB doesn't mean that the other thread will stop executing immediately! In the above case, CPU 0 will continue executing the openat syscall made by T0 even after T0 is destroyed.To my understanding, the above scenario is possible because the kernel supports multiprocessing, as I see support for SMP. Even if in a uniprocessor environment, the threads could interleave to create the above scenario (looking at the syscall stubs, interrupts are enabled when syscalls are made).
Suggestion Instead of cleaning up the thread immediately upon
thread_kill
, mark some sort of flag that the thread should be killed. Then clean up the thread when it's safe to do so, likely on the next reschedule.