recursive lock of a spin

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.r12449, x86/nonvga_debug
2.build with -O2
3.

What is the expected output? What do you see instead?
loading start script:
> ifconfig lo 127.0.0.1 netmask 255.0.0.0 up 

  ______
 |  ____|                                            __          __
 | |___  _ __ ___            ____  ____  ____  _____/ /   _____ / /
 |  ___|| '_ ` _ \          / __ \/ __ \/ __ \/ ___/ /   |_____| |
 | |____| | | | | |_ _ _   / /_/ / /_/ / /_/ (__  )_/    |_____| |
 |______|_| |_| |_(_|_|_)  \____/\____/ .___/____(_)           | |
                                     /_/                        \_\
 ASSERTION FAILED on CPU 0
    at ./src/include/kernel/spinlock.h:92
    in function __spin_trylock

lock->owner != cpu_id

    (Recursive lock of a spin owned by this CPU)

Please use labels and text to provide additional information.

Original issue reported on code.google.com by ki.stfu on 19 Jan 2014 at 12:48

Attachments:

full-log.txt

GoogleCodeExporter commented 9 years ago

2339 loading start script:
2340 > ifconfig lo 127.0.0.1 netmask 255.0.0.0 up
2341 t->lock 0x00554008    -- &lock in thread_init
2342 wa->lock 0x00554918   -- &lock in waitq_init
2343 wa->lock 0x0055426c   -- &lock in waitq_init
2344 trylock 0x00554008    -- thread lock
2345 trylock 0x0024ab3c    -- runq lock
2346 trylock 0x0024ab3c    -- runq lock
2347 trylock 0x002360bc    -- start_script waitq lock
2348 trylock 0x00554918    -- waitq lock (see 2342)
2349 trylock 0x00554008    -- thread lock
2350 trylock 0x0024ab3c    -- runq lock
2351 trylock 0x0024ab3c    -- runq lock (usually it's the last trylock 
operation in this thread and after that the next start_script command starts)
2352 trylock 0xf000fe7e    -- ??? <- assert there
2353 trylock 0xf000fe7e    -- ??? (duplicate of 2352)
2354 trylock 0x00238f34    -- ???

Original comment by ki.stfu on 22 Jan 2014 at 10:51

Attachments:

GoogleCodeExporter commented 9 years ago

That looks weird. I'll try to investigate it soon hopefully.

Original comment by Eldar.Abusalimov on 23 Jan 2014 at 3:35

GoogleCodeExporter commented 9 years ago

I think this is due to thread_terminate/sched_freeze/task_exit/thread_exit. 
Some places seem wrong, for example:

__sched_freeze:
  t->active = false;
  t->waiting = false;

thread_exit:
  current->waiting = true

Original comment by ki.stfu on 23 Jan 2014 at 3:50

GoogleCodeExporter commented 9 years ago

> export PWD=/ 
> export HOME=/ 
root@embox:/#
root@embox:/#
root@embox:/#mount -t nfs 10.0.2.10:/home/hak1r/data /tmp
10.0.2.10: RPC: Remote system error - Connection timed out
tish: mount: Command returned with code -1: Unknown error code. PLEASE REPORT 
ISSUE 519!
root@embox:/#
root@embox:/#
root@embox:/#mount -t nfs 10.0.2.10:/home/hak1r/data /tmp
root@embox:/#umount  /tmp
root@embox:/#mount -t nfs 10.0.2.10:/home/hak1r/data /tmp
root@embox:/#umount  /tmp
root@embox:/#mount -t nfs 10.0.2.10:/home/hak1r/data /tmp
root@embox:/#umount  /tmp

  ______
 |  ____|                                            __          __
 | |___  _ __ ___            ____  ____  ____  _____/ /   _____ / /
 |  ___|| '_ ` _ \          / __ \/ __ \/ __ \/ ___/ /   |_____| |
 | |____| | | | | |_ _ _   / /_/ / /_/ / /_/ (__  )_/    |_____| |
 |______|_| |_| |_(_|_|_)  \____/\____/ .___/____(_)           | |
                                     /_/                        \_\
 ASSERTION FAILED on CPU 0
    at ./src/include/kernel/spinlock.h:92
    in function __spin_trylock

lock->owner != cpu_id

    (Recursive lock of a spin owned by this CPU)

 --   01000000 * A R    thread 1  task 0 --------------------------------------

run 0x0010b924 <boot_stub+0x0>               src/kernel/thread/boot_thread.c:20

 --   01000000     R    thread 0  task 0 --------------------------------------

  1 0x0013f6df <__schedule+0x14b>                  src/kernel/sched/sched.c:379

run 0x0010b8a8 <idle_run+0x0>                src/kernel/thread/idle_thread.c:21

Original comment by ki.stfu on 31 Jan 2014 at 1:21

GoogleCodeExporter commented 9 years ago

Is there a backtrace available? At least where is the spin_lock() called from?

Original comment by Eldar.Abusalimov on 31 Jan 2014 at 1:26

GoogleCodeExporter commented 9 years ago

 32 void waitq_add(struct waitq *wq, struct waitq_link *wql) {
 33     ipl_t ipl;
 34 
 35     assert(wq && wql);
 36 
 37     ipl = spin_lock_ipl(&wq->lock); <-- there
 38     __waitq_add(wq, wql);
 39     spin_unlock_ipl(&wq->lock, ipl);
 40 }
 41

Original comment by ki.stfu on 1 Feb 2014 at 3:28

Attachments:

GoogleCodeExporter commented 9 years ago

 17 int task_waitpid(pid_t pid) {
 18     struct task *task, *parent;
 19     struct waitq_link wql;
 20     int ret = 0;
 21 
 22     waitq_link_init(&wql);
 23 
 24     sched_lock();
 25     {
 26         task = task_table_get(pid);
 28         parent = task->parent;
 29         if (!task) {
 30             ret = -ECHILD;
 31             goto out;
 32         }
 33 
 34         waitq_wait_prepare(task->waitq, &wql);<--- there
 35 
 36         sched_wait();
 37 
 38         ret = parent->child_err;
 39 
 40         /* no cleanup since task is dead already */
 41     }
 42 out:
 43     sched_unlock();
 44     
 45     return ret;
 46 }
 47

Original comment by ki.stfu on 1 Feb 2014 at 3:43

Attachments:

GoogleCodeExporter commented 9 years ago

Here occurs something strange. After checking task for NULL it doesn't jump to 
out..

 17 int task_waitpid(pid_t pid) {
 18     struct task *task, *parent;
 19     struct waitq_link wql;
 20     int ret = 0;
 21 
 22     waitq_link_init(&wql);
 23 
 24     sched_lock();
 25     {
 26         task = task_table_get(pid);
 27         printk("waitpid %p %u\n", task, pid);
 28         parent = task->parent;
 29         if (!task) {
 30             ret = -ECHILD;
 31             goto out;
 32         }
 33 
 34         printk("waitpid prepare wq %p\n", task->waitq);
 35         waitq_wait_prepare(task->waitq, &wql,423);
 36         
 37         sched_wait();
 38         
 39         ret = parent->child_err;
 40         
 41         /* no cleanup since task is dead already */
 42     }
 43 out:
 44     sched_unlock();
 45     
 46     return ret;
 47 }

> route add 127.0.0.0 netmask 255.0.0.0 lo 
t->lock 0x00550008
wa->lock 0x00550918
wa->lock 0x0055026c
trylock 0x00550008
trylock 0x0024ab3c
trylock 0x0024ab3c
trylock 0x002360bc
trylock 0x00550918
trylock 0x00550008
trylock 0x0024ab3c
trylock 0x0024ab3c
waitpid 0x00000000 1
waitpid prepare wq 0xf000fe6e <--- jump was before it
trylock 0xf000fe7e
place 423
QEMU: Terminated

Original comment by ki.stfu on 1 Feb 2014 at 3:45

Attachments:

GoogleCodeExporter commented 9 years ago

It's gcc optimization. After line 28 gcc assume that task not NULL, so the next 
IF statement is not checked and is considered as FALSE.

Original comment by ki.stfu on 1 Feb 2014 at 4:07

GoogleCodeExporter commented 9 years ago

man gcc:
       -fdelete-null-pointer-checks
           Assume that programs cannot safely dereference null
           pointers, and that no code or data element resides there.
           This enables simple constant folding optimizations at all
           optimization levels.  In addition, other optimization
           passes in GCC use this flag to control global dataflow
           analyses that eliminate useless checks for null pointers;
           these assume that if a pointer is checked after it has
           already been dereferenced, it cannot be null.

Original comment by ki.stfu on 1 Feb 2014 at 4:16

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

fixed by r12543

Original comment by ki.stfu on 1 Feb 2014 at 4:20

Changed state: Fixed

SmallLonlyWolf / embox

recursive lock of a spin #675