Deleting a thread causes Illegal page fault

Update: more details in a follow-up comment below. This is really trivial to reproduce.

This is a bit vague, I am not fully sure of what happens here. But: at certain points, if certain errors occur, we can run into a page fault that looks like this:

I have a clear way to reproduce it and some ideas about why this happens so here goes:

How to reproduce

diff --git a/storm/generic/mailbox.c b/storm/generic/mailbox.c
index 02ea46a..883cb85 100644
--- a/storm/generic/mailbox.c
+++ b/storm/generic/mailbox.c
@@ -427,10 +427,10 @@ return_type mailbox_receive(mailbox_id_type mailbox_id, message_parameter_type *
     }

     // Receive the message.
-    next_message_size = circular_queue_peek(mailbox->queue);
+    //next_message_size = circular_queue_peek(mailbox->queue);

-    assert(next_message_size != -1,
-        "circular_queue_peek returned -1 even though a message should be available");
+    // assert(next_message_size != -1,
+    //     "circular_queue_peek returned -1 even though a message should be available");

     unsigned int next_message_data_size = next_message_size - sizeof(message_type);

Apply the diff above and you should get this error right on bootup.

gdb excerpt

Here is a gdb excerpt from when this happens:

(gdb) p/x *current_tss
$2 = {previous_task_link = 0x0, u0 = 0x0, esp0 = 0xfc001000, ss0 = 0x10, u1 = 0x0, esp1 = 0x0, ss1 = 0x0, u2 = 0x0, 
  esp2 = 0x0, ss2 = 0x0, u3 = 0x0, cr3 = 0x12c000, eip = 0x40005212, eflags = 0x216, eax = 0xc0cac01a, ecx = 0xc0cac01a, 
  edx = 0xc0cac01a, ebx = 0xc0cac01a, esp = 0x0, ebp = 0x0, esi = 0x0, edi = 0x0, es = 0x23, u4 = 0x0, cs = 0x1b, 
  u5 = 0x0, ss = 0x23, u6 = 0x0, ds = 0x23, u7 = 0x0, fs = 0x23, uint8_t = 0x0, gs = 0x23, u9 = 0x0, ldt_selector = 0x0, 
  u10 = 0x0, t = 0x0, u11 = 0x0, iomap_base = 0x160, process_type = 0x1, process_id = 0x2, cluster_id = 0x0, 
  thread_id = 0x2, parent_tss = 0xa000, user_id = 0x0, priority_process = 0x0, priority_cluster = 0x0, 
  priority_thread = 0x0, stack_pages = 0x1, allocated_pages = 0x9, mutex_kernel = 0x0, mutex_user_id = 0x0, 
  mutex_time = 0x0, mailbox_id = 0x0, state = 0x0, timeslices = 0x1, thread_name = {0x49, 0x6e, 0x69, 0x74, 0x69, 0x61, 
    0x6c, 0x69, 0x73, 0x69, 0x6e, 0x67, 0x0 <repeats 116 times>}, code_base = 0x132, data_base = 0x136, 
  code_pages = 0x4, data_pages = 0x2, virtual_code_base = 0x40002000, virtual_data_base = 0x40006000, iomap_size = 0xe, 
  capability = {modify_services = 0x0, modify_hardware = 0x0, thread_control_others = 0x0, kill_other_threads = 0x0}, 
  initialised = 0x1, instruction_pointer = 0x0, process_info = 0xe899138, iomap = 0xe8afb78}
(gdb) next
47      return thread_control(thread_id, class, parameter);
(gdb) step
thread_control (thread_id=2, class=0, parameter=0) at thread.c:389
389 {
(gdb) next
394     mutex_kernel_wait(&tss_tree_mutex);
(gdb) 
395     tss = thread_get_tss(thread_id);
(gdb) 
396     if (tss == NULL)
(gdb) 
403     switch (class)
(gdb) 
409             if (thread_id == THREAD_ID_KERNEL)
(gdb) 
414             thread_delete(tss);
(gdb) 
417             if (tss->thread_id == current_tss->thread_id)
(gdb) 
419                 mutex_kernel_signal(&tss_tree_mutex);
(gdb) 
420                 dispatch_next();
(gdb) 

Breakpoint 2, system_call_thread_control (thread_id=3, class=0, parameter=0) at system_call.c:46
46  {
(gdb) 
47      return thread_control(thread_id, class, parameter);
(gdb) step
thread_control (thread_id=3, class=0, parameter=0) at thread.c:389
389 {
(gdb) next
394     mutex_kernel_wait(&tss_tree_mutex);
(gdb) 
395     tss = thread_get_tss(thread_id);
(gdb) 
396     if (tss == NULL)
(gdb) 
403     switch (class)
(gdb) 
409             if (thread_id == THREAD_ID_KERNEL)
(gdb) 
414             thread_delete(tss);
(gdb) 
417             if (tss->thread_id == current_tss->thread_id)
(gdb) 
419                 mutex_kernel_signal(&tss_tree_mutex);
(gdb) 
420                 dispatch_next();
(gdb)

I have a feeling that the tss_list gets corrupted at some point (perhaps because of incorrect mutexing). Look at the next pointers in this trace:

(gdb) print tss_list
$1 = (tss_list_type *) 0xe8ac3c8
(gdb) print *tss_list
$2 = {previous = 0x0, next = 0xe8ac398, tss = 0xe8b2818, thread_id = 18}
(gdb) print tss_list->next
$3 = (struct tss_list_type *) 0xe8ac398
(gdb) print *tss_list->next
$4 = <incomplete type>
(gdb) print tss_list->next
$5 = (struct tss_list_type *) 0xe8ac398
(gdb) print (tss_list_type)tss_list->next
$6 = {previous = 0xe8ac398, next = 0xe8b2818, tss = 0x12, thread_id = 243974328}
(gdb) print (tss_list_type)tss_list->next->next
There is no member named next.
(gdb) print ((tss_list_type)tss_list->next)->next
$7 = (struct tss_list_type *) 0xe8b2818
(gdb) print (((tss_list_type)tss_list->next)->next
A syntax error in expression, near `'.
(gdb) print *((tss_list_type)tss_list->next)->next
$8 = <incomplete type>
(gdb) print (tss_list_type)((tss_list_type)tss_list->next)->next
$9 = {previous = 0xe8b2818, next = 0x12, tss = 0xe8ac0b8, thread_id = 243975144}

The next step would be to use the watch tss_list approach as suggested in #21 to see how the TSS list gets modified. Maybe that would help us identify what the source of the corrupted memory is.

chaos4ever / chaos

Deleting a thread causes Illegal page fault #155

How to reproduce

gdb excerpt