Race condition in heap allocator

sdww0 commented 1 month ago

Describe the bug

The program was stuck when I tried to run iperf3. After setting the log level to debug, I found it stuck at the heap allocator. Using the gdb_server, I was able to find that the program was stuck at the timer. More information can be found in the Screenshots section.

To Reproduce

Build with release mode.
Go to /benchmark/bin
Run ./iperf3 -s -B 10.0.2.15 -p 8080
The program will be stuck.

Expected behavior

Screenshots

Debug log when the program was stuck:

Call Stack:

Environment

CPU: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
Qemu: 8.2.1
OS: Ubuntu-22.04

Additional context

If changing the

    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        ...

        self.heap
            .lock()
            .alloc(layout)
            .map_or(core::ptr::null_mut::<u8>(), |allocation| {
                allocation.as_ptr()
            })
    }

to

    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        ...

        let res = self.heap
            .lock()
            .alloc(layout)
            .map_or(core::ptr::null_mut::<u8>(), |allocation| {
                allocation.as_ptr()
            });
        res
    }

then the race condition will be disappeared.

Logs

tatetian commented 1 month ago

I think lock should be replaced with lock_irq_disabled.

lrh2000 commented 1 month ago

I think lock should be replaced with lock_irq_disabled.

This should be irrelevant because the IRQs are disabled at the beginning of the alloc function.

https://github.com/asterinas/asterinas/blob/5aa28eae7e14594bbe68827114443b31002bf742/ostd/src/mm/heap_allocator.rs#L68-L69

tatetian commented 1 month ago

I guess that this reported issue is caused by a known limitation.

Our partner CertiK has found, as part of their effort to audit the memory subsystem of ostd, a chicken-egg problem between the heap allocator and page allocator: when the heap allocator has no available pages, it will call the rescue method, which in turns call page_allocator.alloc() to acquire more pages. But inside the page_allocator.alloc method, it may need to allocate new objects again. If this happens, the heap allocator will have to be locked again. But the heap allocator is already locked!

They have proposed two strategies to fix this:

Use a separate, dedicated allocator for the BTreeSet<> nodes within the frame allocator. This allocator should not depend on the main heap or frame allocator.
Implement a fixed-size, pre-allocated memory pool for the BTreeSet<> nodes. This approach ensures that node allocations do not trigger additional heap or frame allocations.

tatetian commented 1 month ago

@Tangzh33 Could you take the job of fixing this chicken-egg problem between the heap allocator and page allocator?

Tangzh33 commented 1 month ago

Thanks for bringing this up! I'll consider addressing aforementioned problems in the context of issue#1044.

asterinas / asterinas