AIFM-sys / AIFM

AIFM: High-Performance, Application-Integrated Far Memory
MIT License
104 stars 34 forks source link

Sche.c preempt assertion error #4

Closed YangZhou1997 closed 3 years ago

YangZhou1997 commented 3 years ago

Hi, Zain

I understand this might not be a suitable place to ask, but I really want to get your help on shenango runtime. I encounter preempt assertion errors when I am modifying AIFM source code to add customized functionality.

[413.181084] CPU 02| <0> FATAL: runtime/sched.c:558 ASSERTION '(preempt_cnt & ~(1 << 31)) != 1' FAILED IN 'schedule'

Basically, I add some spin lock, cond_var, and use the TCP stack. The error looks to me that the preempt_cnt is not set suitably when threads get scheduled. But I feel that properly unlock every lock when exiting a function.

I wonder if you have any experiences of how this would happen. Can I just comment that assertion?

Best, Yang

zainryan commented 3 years ago

Hi do you have any lock being hold while rescheduling the thread? Thread rescheduling can happen both explicitly (e.g., you invoke thread_yield()) or implicitly (e.g., invokes TCP read/write).

YangZhou1997 commented 3 years ago

I just find that the error happens when the program call exit(0), which inside manager's deconstruct function, thread_yield() will be called.

Another quick question is that, you mention one thread cannot hold a lock when invoking TCP read/write? that sounds weird. My understanding is: thread normally acquires a lock for a buffer, then send it out through TCP, finally unlocks. Can you elaborate a bit on Shenango lock and TCP operations?

zainryan commented 3 years ago

Hi, when invoking things like TCP read, usually you have to wait for the incoming packets. Therefore, it will yield its core to regrant CPU resources to other threads for doing useful tasks. However, yielding the core with the lock held can usually cause a deadlock. That's because the new thread will try to acquire the lock and spin forever (since it's still held by the old thread).

This makes us decide that when acquiring the lock we always disable the preemption, and when resched we always check if the preemption is correctly set (i.e., enabled).

YangZhou1997 commented 3 years ago

That makes sense! Beyond the TCP locking design, I also find that the TCP write will sometimes be stuck if you only use a single kthread but with two uthreads, even after I apply the patch you mention here: https://github.com/AIFM-sys/AIFM/issues/3#issuecomment-766522041. The stucking seems to disappear when I use two kthreads. Not sure if you have any comments on that.

zainryan commented 3 years ago

Yeah, it might be. I believe there is still some TCP window deadlock bug when you try to write a very large chunk of data. Keeping the data size below the usual window size would bypass the issue.

YangZhou1997 commented 3 years ago

I see. Thanks! Regarding the TCP stack, would Caladan or the original shenango provide better performance or stability? I was asking as I find when the object you want to send exceeds 64KB, the TCP write latency bumps from 25.6us (16KB) to 11969.8us.

zainryan commented 3 years ago

That's more like a bug that gets triggered when the TCP win is full. I'm not very sure if it has been fixed in the latest Caladan or Shenango repo, but you can have a try there. If it's still not working well, you can open an issue under the Calanda repo with your test code and configurations.

YangZhou1997 commented 3 years ago

Sounds good! Thanks!