[BUG] PMM Deadlock - Githubissues

jcjgraf commented 3 days ago

Describe the bug There is a bug in the PMM where it locks-up (deadlock?). It happens when the allocator runs out of 4k frames and splits a higher-order frame. Maybe no 4k page was reserved for the new signling page?

To Reproduce

The testcase test_kernel_task_func2 demonstrates the issue (#341 )
Set KT2_ROUNDS to 1 and KT2_CHUNKS to 1000 to allocate enough 4k frames such that they get depleted
The testcase will either result in an assertion in the pmm getting triggered or by locking-up
(optional) Fix the srand seed to one where you repeatedly observe the deadlock

Expected behavior The testcase just passes. Throwing more memory at the issue, by increasing .qemu_config:QEMU_RAM, has seemingly no impact.

Props to Sandro Rüegge (@sparchatus) for helping me debugging this.

sparchatus commented 19 hours ago

Analysis (take with grain of salt, am not experienced with kernels) The test calls vmap_kern_4k which takes vmap_lock lock (in vmap_4k). The paging code sometimes requires another call to vmap_kern_4k (when running out of frame array entries) through get_free_frame which is called in get_pgentry_mfn. This appears to cause the deadlock observed here.

If ktf runs in parallel on multiple cores there is another potential deadlock scenario: One core calls vmap_kern_4k while another core calls get_free_frame. In the unlikely event that we also run out of array frames at the same time we end up with one core having the paging lock and wanting the frame lock while the second core is in the opposite situation.

Potential Solutions?

Release the paging lock when calling get_free_frame. This in turn might necessitate re-walking the page table. (same for get_free_frame which would need to free the pmm lock before mapping more frame array space)
Another approach could be to allocate the potentially required frames before taking the paging lock (or even having a set of "ready" frames in the paging which we can refill afterwards)

I can make a pull request for either solution. Since I have no experience in kernel development I'd appreciate some feedback on whether these are viable solutions before making the pull requests.

wipawel commented 18 hours ago

The analysis reasoning is correct I think. The PMM implementation tries to take care of this already (see the usage of MIN_FREE_FRAMES_THRESHOLD and MAX_FREE_FRAMES_THRESHOLD at https://github.com/KernelTestFramework/ktf/blob/e1658d80f4397def5fa10d5f46e3446f10700369/mm/pmm.c#L37 ). Apparently, it fails to do so correctly.

Hence, the second approach of making sure there is always available frame for the frames array is the way to go.

sparchatus commented 16 hours ago

The reservation looks good to me. The problem seems to be with mapping the newly acquired frame to use it as a frames array. This mapping might require another call to get_free_frame. There are two issues here:

get_free_frame might have been called by paging in the first place so we cannot just do another call to paging due to the lock.
The get_free_frame needs to realize it is in array acquisition stage and give out reserved frames without trying to refill again. However, to know this it needs to take the lock which is already taken.

Intuitively I'd try to solve both these issues using re-entrant locks but they don't exist yet in ktf and bring their own load of complexity...

sparchatus commented 13 hours ago

Alright, new suggestion :D

Paging wants to avoid pmm 4k frame refill (and array allocation) so it gets its own get_free_frame_norefill method to avoid this case. Further, there is a way for pmm to tell the paging that the paging request is because refill is currently running and paging should use get_free_frame_norefill_nolock instead to avoid a deadlock (essentially a cheap reentry). This should ensure consistency and avoid these pesky deadlocks... We still need to make sure that the refill does occur after or before paging.

KernelTestFramework / ktf

[BUG] PMM Deadlock #343