Closed MTres19 closed 2 years ago
@MTres19 Can you point to the semaphore that it is hanging on?
Also slightly unrelated to the issue at hand, but it seems like you might be wanting to use the low/highpriority worker thread mechanism that most of the other drivers use for scheduling rx/tx work.
Hi @btashton yep, I think it's the semaphore for the list of open file descriptors in the task control that's being initialized. struct tcb_s.group->tg_filelist.fl_sem
. But I'm not familiar with the code so it's hard to see the big picture... the relevant line in my driver is here if you're curious. I assume this problem hasn't come up before because most drivers wouldn't create a kernel thread in a function that's called from the application.
This "fixes" the problem for me but it's just a band-aid I know.
diff --git a/sched/task/task_init.c b/sched/task/task_init.c
index b187b74ebe..3e3fd61a45 100644
--- a/sched/task/task_init.c
+++ b/sched/task/task_init.c
@@ -101,11 +101,14 @@ int nxtask_init(FAR struct task_tcb_s *tcb, const char *name, int priority,
}
/* Associate file descriptors with the new task */
-
- ret = group_setuptaskfiles(tcb);
- if (ret < 0)
+
+ if (!(ttype & TCB_FLAG_TTYPE_KERNEL))
{
- goto errout_with_group;
+ ret = group_setuptaskfiles(tcb);
+ if (ret < 0)
+ {
+ goto errout_with_group;
+ }
}
if (stack)
Based on this comment.
Regarding a dedicated thread vs the work queue, I actually do use the low-priority work queue to check periodically for bus errors when servicing the interrupt would be excessive. I read the admonition against using the work queue for things like this on the wiki.
The issue with using a worker thread for what you have is that you have a while loop that waits on a semaphore https://github.com/MTres19/incubator-nuttx/blob/b63c54bb52f4158c6495723359032668f0947080/arch/arm/src/tiva/common/tiva_can.c#L731
This is a great place to use a a worker thread. Right here you currently are signalling your kernel thread to do work, instead you can schedule work and the work function itself will not wait and deadlock the the work queue. https://github.com/MTres19/incubator-nuttx/blob/b63c54bb52f4158c6495723359032668f0947080/arch/arm/src/tiva/common/tiva_can.c#L1791
work_queue(HPWORK, &priv->cbwork, (worker_t)priv->callback,
priv->cbarg, 0);
Note that the arg is stored in the device structure. This is one issue with what you are doing right now with your kernel thread, you are passing arguments that are allocated on the stack so they do not have have the proper lifecycle for the thread.
If you really want to keep using the kernel thread for performance reasons, You should be able to just move that initialization to the driver bring up and it will just be stuck waiting on the semaphore.
Thanks for the input @btashton. I'm still not sure I like the idea of shoving things onto the work queue though, especially since the Tiva ADC driver already uses it. For example, in my application I want to short a sensor to ground, measure a different sensor to ensure the two sensors aren't shorted together, and then release the first sensor before the external analog plausibility logic detects it. If a bunch of CAN messages happened to come in and block the work queue in the middle of that, I imagine there could be problems.
But regarding the kernel thread... isn't argv copied to the new thread's TCB? It seems like it should be safe to use stack memory to pass the arguments in that case. I agree I could move the kernel thread without much trouble but I think there is still a bug here nonetheless.
Tiva ADC driver already uses it. For example, in my application I want to short a sensor to ground, measure a different sensor to ensure the two sensors aren't shorted together, and then release the first sensor before the external analog plausibility logic detects it. If a bunch of CAN messages happened to come in and block the work queue in the middle of that, I imagine there could be problems.
I don't see how this will be an issue, your driver should only have work scheduled once on the work queue. The logic you have in that thread is very small, small enough that most of the other CAN drivers just handle calling can_receive
right in the interrupt handler. Where it is bad is if you have a worker callback that waits on semaphore or polls on a ready/busy register. I did not see either of these cases here. So if you get a bunch of ADC work CAN work come in at the same time they should be equally scheduled and you may actually pay less in context switching.
But regarding the kernel thread... isn't argv copied to the new thread's TCB? It seems like it should be safe to use stack memory to pass the arguments in that case. I agree I could move the kernel thread without much trouble but I think there is still a bug here nonetheless.
You are right that is handled by nxtask_setup_stackargs
I was thinking about something else.
I am not disagreeing yet that there might be a bug, just pointing out how what you are doing is different than how most other drivers are written and that I don't really see the benefit (I see a drawback in resource usage).
I just realized there might already be work ongoing on this in #2661.
That makes sense. I hadn't thought much about the context switch overhead. I did originally just call can_receive in the ISR but I realized that kinda defeated the purpose of a hardware RX FIFO. I might switch it to use a work queue then.
@MTres19 when do your kthread get created? Is it before idle task setup? https://github.com/apache/incubator-nuttx/blob/master/sched/init/nx_start.c#L767
I just realized there might already be work ongoing on this in #2661.
Yes, this patch may help you.
It's in tivacan_setup, which gets called when the char device is opened, so it tries to inherit the app's file descriptors.
On Sun, Jan 10, 2021, 3:27 AM Xiang Xiao notifications@github.com wrote:
@MTres19 https://github.com/MTres19 when do your kthread get created? Is it before idle task setup?
https://github.com/apache/incubator-nuttx/blob/master/sched/init/nx_start.c#L767
I just realized there might already be work ongoing on this in #2661 https://github.com/apache/incubator-nuttx/pull/2661.
Yes, this patch may help you.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/incubator-nuttx/issues/2663#issuecomment-757438035, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKWLADPC7OHAS47FMONSETSZFQGDANCNFSM4V3UVI6Q .
so it happen very late, then it's strange that file_dup2 fail to acquire the semphare.
@xiaoxiang781216 I just noticed something similar when trying to create a kernel thread off of a call in sim_bringup()
(I'm add usbhost support via libusb). I get a segfault because in sched_dupfiles
rtcb->group
is 0 which means parent and child are both invalid
Thread 1 "nuttx" received signal SIGSEGV, Segmentation fault.
sched_dupfiles (tcb=0x7ffff3e295d0) at group/group_setuptaskfiles.c:103
103 if (parent[i].f_inode &&
Missing separate debuginfos, use: dnf debuginfo-install libgcc-10.2.1-9.fc33.x86_64 libusbx-1.0.23-2.fc33.x86_64 systemd-libs-246.7-2.fc33.x86_64 zlib-1.2.11-23.fc33.x86_64
(gdb) p parent
$1 = (struct file *) 0x78
(gdb) list
98 /* Check if this file is opened by the parent. We can tell if
99 * if the file is open because it contain a reference to a non-NULL
100 * i-node structure.
101 */
102
103 if (parent[i].f_inode &&
104 (parent[i].f_oflags & O_CLOEXEC) == 0)
105 {
106 /* Yes... duplicate it for the child */
107
(gdb) p rtcb
$2 = (struct tcb_s *) 0x7ffff3e27c90
(gdb) p rtcb->group
$3 = (struct task_group_s *) 0x0
(gdb) p rtcb*
A syntax error in expression, near `'.
(gdb) p *rtcb
$4 = {flink = 0x43f8a0 <g_idletcb>, blink = 0x7ffff7f38020 <_IO_strn_jumps>, group = 0x0, pid = 1, start = 0x402184 <nxtask_start>, entry = {pthread = 0x408e78 <nsh_main>, main = 0x408e78 <nsh_main>}, sched_priority = 100 'd',
init_priority = 100 'd', task_state = 3 '\003', flags = 0, lockcount = 0, errcode = 0, waitdog = {next = 0x0, func = 0x85b82ce26aee8900, lag = 0, flags = 0 '\000', arg = 9635500753727424768}, adj_stack_size = 140737285094944,
stack_alloc_ptr = 0x85b82ce26aee8900, adj_stack_ptr = 0x4412a0, waitsem = 0x0, sigprocmask = 4091706928, sigwaitmask = 32767, sigpendactionq = {head = 0x4412a0, tail = 0xffffffff}, sigpostedq = {head = 0x3, tail = 0x4441d0}, sigunbinfo = {
si_signo = 5 '\005', si_code = 23 '\027', si_errno = 232 '\350', si_value = {sival_int = 48, sival_ptr = 0x3000000030}, si_pid = 32304, si_status = 32767}, msgwaitq = 0x7ffff3e27d60, mhead = 0x85b82ce26aee8900, xcp = {
sigdeliver = 0x7ffff3e27e1c, regs = {4539616, 0, 0, 0, 140737353413158, 4473296, 140737351999557, 140737351996080}}, name = "x\030\371\367\377\177\000\000\000\000\000\000\000\000\000\000 \203\363\367\t\000\000\000\340DE\000\000\000\000"}
(gdb) p rtcb->group
$5 = (struct task_group_s *) 0x0
(gdb) bt
#0 sched_dupfiles (tcb=0x7ffff3e295d0) at group/group_setuptaskfiles.c:103
#1 0x00000000004028c8 in group_setuptaskfiles (tcb=0x7ffff3e295d0) at group/group_setuptaskfiles.c:219
#2 0x0000000000401ae8 in nxtask_init (tcb=0x7ffff3e295d0, name=0x4335bf "usbhost", priority=50, stack=0x0, stack_size=65536, entry=0x42e28a <usbhost_waiter>, argv=0x0) at task/task_init.c:105
#3 0x000000000040196f in nxthread_create (name=0x4335bf "usbhost", ttype=2 '\002', priority=50, stack_size=65536, entry=0x42e28a <usbhost_waiter>, argv=0x0) at task/task_create.c:94
#4 0x0000000000401a83 in kthread_create (name=0x4335bf "usbhost", priority=50, stack_size=65536, entry=0x42e28a <usbhost_waiter>, argv=0x0) at task/task_create.c:235
#5 0x000000000042e3a8 in up_init_usbhost (bus=0) at sim_bringup.c:129
#6 0x000000000042e4e8 in sim_bringup () at sim_bringup.c:491
#7 0x000000000042e283 in board_app_initialize (arg=0) at sim_appinit.c:82
#8 0x000000000042a78f in boardctl (cmd=65281, arg=0) at boardctl.c:326
#9 0x0000000000408f33 in nsh_initialize () at nsh_init.c:103
#10 0x0000000000408ec4 in nsh_main (argc=1, argv=0x7ffff3e29340) at nsh_main.c:143
#11 0x000000000040591d in nxtask_startup (entrypt=0x408e78 <nsh_main>, argc=1, argv=0x7ffff3e29340) at sched/task_startup.c:165
#12 0x0000000000402232 in nxtask_start () at task/task_start.c:144
#13 0x0000000000000000 in ?? ()
@xiaoxiang781216 I just noticed something similar when trying to create a kernel thread off of a call in
sim_bringup()
(I'm add usbhost support via libusb). I get a segfault because insched_dupfiles
rtcb->group
is 0 which means parent and child are both invalid
This is probably running on the IDLE thread which has no parent and the PID for the IDLE thread always zero. Zero does not mean invalid anywhere except in the case of the parent PID here.
The meaning PID of zero is overloaded elsewhere too: In several POSIX APIs, PID == 0 means "this task." But that is not an issue here.
Just try the same code in my environment, rtcb->group point to the good value: I suspect that you may hit some memory corruption, could you just try to create a kernel thread without linking with libusb?
Hmm, you might be right, I could not find anything else that made sense when I was looking last night. The library does get initialized in the parent thread which could use a fair bit of the stack. The memory around group looked good which is why I did not suspect this, but I'll mess with it again tonight.
So taking a fresh look at this and not being quite so pressed for time, I would like to discuss what the correct fix for this bug would be. Restated, the problem is:
App calls open()
→ can_open()
→ tivacan_open()
, which calls kthread_create()
. kthread_create
calls nxthread_create()
which allocates the thread control block, sets the thread type to kernel and calls nxtask_init()
. nxtask_init()
does the following:
group_allocate()
. group_allocate
allocates memory on the heap but it also copies the UID, GID, and environment of the current task to the new task. That doesn't make sense for kernel threads.group_setuptaskfiles()
. Again this tries to copy userspace file descriptors to the kernel thread which is wrong.group_initialize()
to finish the setup.To me it seems kernel threads should have a different task group struct that excludes at least tg_filelist
, tg_streamlist
, tg_envsize
, and tg_envp
. Or at least these should not be initialized for kernel threads. The latter option would be easy, like the band-aid patch I already posted.
It also seems that kernel threads should have their UID and GID set to 0 instead of inheriting it.
Are these accurate conclusions? Given a recommendation I could submit a PR.
To me it seems kernel threads should have a different task group struct that excludes at least tg_filelist, tg_streamlist, tg_envsize, and tg_envp. Or at least these should not be initialized for kernel threads. The latter option would be easy, like the band-aid patch I already posted You are right. There is no purpose in having most of those things in a kernel thread since they cannot have pthreads and, hence, can't really have a task group. See Why Can't Kernel Threads Have pthreads? - NUTTX - Apache Software Foundation.
Why Can't Kernel Threads Have pthreads? - NUTTX - Apache Software Founda...
|
|
|
There are other complications like: What if a kernel thread creates a user task? The inheritance of file descriptors, stdio I/O and the environment will get a lot more complex.
The current design does save a lot of code duplication by re-using all of the user-space task structures.
It also seems that kernel threads should have their UID and GID set to 0 instead of inheriting it. Yes, but there is not fully developed user-based security in the OS so it would end up being mostly cosmetic.
— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: @.***>
How about skip some normal setup(e.g. fd dup) from kernel thread creation?
How about skip some normal setup(e.g. fd dup) from kernel thread creation?
Well, it fixes the problem but like I said it's just a bandaid. A shortcut, rather than addressing the real problem. IMO better to leave this and #1108 open and I'll carry the patch downstream. Then we could work on a proper fix with all the necessary testing.
More people hit the similar issue, let's fix it by skip the dup now(https://github.com/apache/incubator-nuttx/pull/5379) until we get the better solution.
More people hit the similar issue, let's fix it by skip the dup now(#5379) until we get the better solution
If we merge https://github.com/apache/incubator-nuttx/pull/5379 then should we create a new issue to "get a better solution"?
We can track the better solution by https://github.com/apache/incubator-nuttx/issues/1108?
In developing my CAN driver for the Tiva chips, I decided to use a kernel thread to load messages from the RX mailboxes. I call kthread_create like this:
where kthd_argv is a string vector with a terminating null entry as specified in the comments for kthread_create. This is done in
tivacan_setup()
, which is called bycan_open()
. However,kthread_create()
hangs infile_dup2()
; for some reason it copies 3 file descriptors (stdin, stdout, stderr, I suppose) then hangs waiting for a semaphore. This should not happen—kthreads should not even have memory allocated for open file descriptors.