Open jcjgraf opened 3 days ago
Analysis (take with grain of salt, am not experienced with kernels)
The test calls vmap_kern_4k
which takes vmap_lock
lock (in vmap_4k
). The paging code sometimes requires another call to vmap_kern_4k
(when running out of frame array entries) through get_free_frame
which is called in get_pgentry_mfn
. This appears to cause the deadlock observed here.
If ktf runs in parallel on multiple cores there is another potential deadlock scenario:
One core calls vmap_kern_4k
while another core calls get_free_frame
. In the unlikely event that we also run out of array frames at the same time we end up with one core having the paging lock and wanting the frame lock while the second core is in the opposite situation.
Potential Solutions?
I can make a pull request for either solution. Since I have no experience in kernel development I'd appreciate some feedback on whether these are viable solutions before making the pull requests.
The analysis reasoning is correct I think. The PMM implementation tries to take care of this already (see the usage of MIN_FREE_FRAMES_THRESHOLD
and MAX_FREE_FRAMES_THRESHOLD
at https://github.com/KernelTestFramework/ktf/blob/e1658d80f4397def5fa10d5f46e3446f10700369/mm/pmm.c#L37 ). Apparently, it fails to do so correctly.
Hence, the second approach of making sure there is always available frame for the frames array is the way to go.
The reservation looks good to me. The problem seems to be with mapping the newly acquired frame to use it as a frames array. This mapping might require another call to get_free_frame
. There are two issues here:
get_free_frame
might have been called by paging in the first place so we cannot just do another call to paging due to the lock.get_free_frame
needs to realize it is in array acquisition stage and give out reserved frames without trying to refill again. However, to know this it needs to take the lock which is already taken.Intuitively I'd try to solve both these issues using re-entrant locks but they don't exist yet in ktf and bring their own load of complexity...
Alright, new suggestion :D
Paging wants to avoid pmm 4k frame refill (and array allocation) so it gets its own get_free_frame_norefill
method to avoid this case. Further, there is a way for pmm to tell the paging that the paging request is because refill is currently running and paging should use get_free_frame_norefill_nolock instead to avoid a deadlock (essentially a cheap reentry). This should ensure consistency and avoid these pesky deadlocks... We still need to make sure that the refill does occur after or before paging.
Describe the bug There is a bug in the PMM where it locks-up (deadlock?). It happens when the allocator runs out of 4k frames and splits a higher-order frame. Maybe no 4k page was reserved for the new signling page?
To Reproduce
test_kernel_task_func2
demonstrates the issue (#341 )KT2_ROUNDS
to 1 andKT2_CHUNKS
to 1000 to allocate enough 4k frames such that they get depletedsrand
seed to one where you repeatedly observe the deadlockExpected behavior The testcase just passes. Throwing more memory at the issue, by increasing
.qemu_config:QEMU_RAM
, has seemingly no impact.Props to Sandro Rüegge (@sparchatus) for helping me debugging this.