Closed YangZhou1997 closed 3 years ago
Hi do you have any lock being hold while rescheduling the thread? Thread rescheduling can happen both explicitly (e.g., you invoke thread_yield()) or implicitly (e.g., invokes TCP read/write).
I just find that the error happens when the program call exit(0), which inside manager's deconstruct function, thread_yield() will be called.
Another quick question is that, you mention one thread cannot hold a lock when invoking TCP read/write? that sounds weird. My understanding is: thread normally acquires a lock for a buffer, then send it out through TCP, finally unlocks. Can you elaborate a bit on Shenango lock and TCP operations?
Hi, when invoking things like TCP read, usually you have to wait for the incoming packets. Therefore, it will yield its core to regrant CPU resources to other threads for doing useful tasks. However, yielding the core with the lock held can usually cause a deadlock. That's because the new thread will try to acquire the lock and spin forever (since it's still held by the old thread).
This makes us decide that when acquiring the lock we always disable the preemption, and when resched we always check if the preemption is correctly set (i.e., enabled).
That makes sense! Beyond the TCP locking design, I also find that the TCP write will sometimes be stuck if you only use a single kthread but with two uthreads, even after I apply the patch you mention here: https://github.com/AIFM-sys/AIFM/issues/3#issuecomment-766522041. The stucking seems to disappear when I use two kthreads. Not sure if you have any comments on that.
Yeah, it might be. I believe there is still some TCP window deadlock bug when you try to write a very large chunk of data. Keeping the data size below the usual window size would bypass the issue.
That's more like a bug that gets triggered when the TCP win is full. I'm not very sure if it has been fixed in the latest Caladan or Shenango repo, but you can have a try there. If it's still not working well, you can open an issue under the Calanda repo with your test code and configurations.
Sounds good! Thanks!
Hi, Zain
I understand this might not be a suitable place to ask, but I really want to get your help on shenango runtime. I encounter preempt assertion errors when I am modifying AIFM source code to add customized functionality.
[413.181084] CPU 02| <0> FATAL: runtime/sched.c:558 ASSERTION '(preempt_cnt & ~(1 << 31)) != 1' FAILED IN 'schedule'
Basically, I add some spin lock, cond_var, and use the TCP stack. The error looks to me that the preempt_cnt is not set suitably when threads get scheduled. But I feel that properly unlock every lock when exiting a function.
I wonder if you have any experiences of how this would happen. Can I just comment that assertion?
Best, Yang