YottaDB / YDB

Mirrored from https://gitlab.com/YottaDB/DB/YDB
76 stars 37 forks source link

YottaDB Darwin Port: passed sockets between jobbed processes don't work #292

Open shabiel opened 6 years ago

shabiel commented 6 years ago

For example, the M-Web-Server won't work.

It previously worked on the last port to Darwin, in V6.2-002A.

Confirmed on two different Macs.

I will debug as time permits.

nars1 commented 6 years ago

Not sure if it helps but maybe https://github.com/YottaDB/YottaDB/issues/275 is related.

shabiel commented 6 years ago

Interesting. I want to compile the latest source code for YottaDB on my Linux machine and see if we have the same issue.

shabiel commented 6 years ago

Shouldn't be a surprise, but I at least confirmed it's not an issue on Linux.

@nars1, any advice on debugging this? The multiple forks make it difficult. The way I debugged gtmshrsec on Cygwin was to put in sleeps and then run and attach to the process while it's sleeping.

shabiel commented 6 years ago

Found the crash.

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib          0x00007fff687cd4aa __kill + 10
1   libyottadb.dylib                0x0000000105b68714 gtm_dump_core + 1332 (gtm_dump_core.c:69)
2   libyottadb.dylib                0x0000000105b6d981 gtm_fork_n_core + 2241
3   libyottadb.dylib                0x0000000105ae9ebb ch_cond_core + 475
4   libyottadb.dylib                0x0000000105ea9d45 rts_error_va + 3333
5   libyottadb.dylib                0x0000000105eaa307 rts_error_csa + 359
6   libyottadb.dylib                0x0000000105e455d0 middle_child + 1168 (ojstartchild.c:187)
7   libyottadb.dylib                0x0000000105eaa0b9 rts_error_va + 4217 (rts_error.c:160)
8   libyottadb.dylib                0x0000000105eaa307 rts_error_csa + 359
9   libyottadb.dylib                0x0000000105e3fd88 ojstartchild + 19000 (ojstartchild.c:612)
10  libyottadb.dylib                0x0000000105e64c17 op_job + 4279 (op_job.c:190)
11  ???                             0x000000010b9575b0 0 + 4489311664

Crashes here:

             SEND(setup_fds[0], &params, SIZEOF(params), 0, rc);
             if (rc < 0)

Previous SENDs are apparently successful.

shabiel commented 6 years ago

Okay. After an hour of debugging, it turns out it's crashing at random sends, which means that the grandchild process is crashing at the get-go and the sends that succeed just succeed accidentally.

shabiel commented 6 years ago

I think I finally found the problem. I am doing the stepping of si into assembly so that I can catch it at the right time.

(lldb) process attach -n mumps -w
Process 61571 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGTRAP
    frame #0: 0x00007fff686dc666 libsystem_c.dylib`fork + 18
->  0x7fff686dc666 <+18>: retq   
    0x7fff686dc667 <+19>: testl  %ebx, %ebx
    0x7fff686dc669 <+21>: je     0x7fff686dc67d            ; <+41>
    0x7fff686dc66b <+23>: cmpl   $-0x1, %ebx
Target 0: (mumps) stopped.

Executable module set to "/Users/sam/Documents/repos/YottaDB/build/./mumps".
Architecture set to: x86_64-apple-macosx.
(lldb) si
Process 61571 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007fff686dc666 libsystem_c.dylib`fork + 18
->  0x7fff686dc666 <+18>: retq   
    0x7fff686dc667 <+19>: testl  %ebx, %ebx
    0x7fff686dc669 <+21>: je     0x7fff686dc67d            ; <+41>
    0x7fff686dc66b <+23>: cmpl   $-0x1, %ebx
Target 0: (mumps) stopped.
Process 61571 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
    frame #0: 0x000000010e43d0a0 libyottadb.dylib`intrpt_ok_state
->  0x10e43d0a0 <+0>: sbbb   %al, (%rax)
    0x10e43d0a2 <+2>: addb   %al, (%rax)

    0x10e43d0a4 <+0>: addl   %eax, (%rax)
    0x10e43d0a6 <+2>: addb   %al, (%rax)
Target 0: (mumps) stopped.
(lldb) si
Process 61571 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x10e43d0a0)
    frame #0: 0x000000010e43d0a0 libyottadb.dylib`intrpt_ok_state
->  0x10e43d0a0 <+0>: sbbb   %al, (%rax)
    0x10e43d0a2 <+2>: addb   %al, (%rax)

    0x10e43d0a4 <+0>: addl   %eax, (%rax)
    0x10e43d0a6 <+2>: addb   %al, (%rax)
Target 0: (mumps) stopped.
shabiel commented 6 years ago

More stuff from the same stack. I am puzzled actually by this. None of it makes sense.

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x10e43d0a0)
    frame #0: 0x000000010e43d0a0 libyottadb.dylib`intrpt_ok_state
    frame #1: 0x000000010e43c1f0 libyottadb.dylib`xfer_name + 2336
  * frame #2: 0x000000010db8779e libyottadb.dylib`ojstartchild(jparms=0x00007ffee2adfd00, argcnt=1, non_exit_return=0x00007ffee2adfdbc, pipe_fds=0x00007ffee2adfe68) at ojstartchild.c:389
    frame #3: 0x000000010dbb0c17 libyottadb.dylib`op_job(argcnt=1) at op_job.c:190
    frame #4: 0x000000010ed361a2
(lldb) f 0
frame #0: 0x000000010e43d0a0 libyottadb.dylib`intrpt_ok_state
->  0x10e43d0a0 <+0>: sbbb   %al, (%rax)
    0x10e43d0a2 <+2>: addb   %al, (%rax)

    0x10e43d0a4 <+0>: addl   %eax, (%rax)
    0x10e43d0a6 <+2>: addb   %al, (%rax)
(lldb) p prev_intrpt_state
error: use of undeclared identifier 'prev_intrpt_state'
(lldb) f 1
frame #1: 0x000000010e43c1f0 libyottadb.dylib`xfer_name + 2336
    0x10e43c1f0 <+0>: addb   %dh, %al
    0x10e43c1f2 <+2>: pushq  %rdi
    0x10e43c1f3 <+3>: orl    $0x1, %eax
    0x10e43c1f8 <+8>: xorb   %dh, 0x12(%rbp)
(lldb) p prev_intrpt_state
error: use of undeclared identifier 'prev_intrpt_state'
(lldb) f 2
frame #2: 0x000000010db8779e libyottadb.dylib`ojstartchild(jparms=0x00007ffee2adfd00, argcnt=1, non_exit_return=0x00007ffee2adfdbc, pipe_fds=0x00007ffee2adfe68) at ojstartchild.c:389
   386          rts_error_csa(CSA_ARG(NULL) VARLSTCNT(6) ERR_YDBDISTUNVERIF, 4, STRLEN(ydb_dist), ydb_dist,
   387                  gtmImageNames[image_type].imageNameLen, gtmImageNames[image_type].imageName);
   388      FFLUSH(NULL);
-> 389      FORK_RETRY(child_pid);
   390      if (child_pid == 0)
   391      {
   392          /* DEBUG */
(lldb) p prev_intrpt_state
(intrpt_state_t) $7 = INTRPT_OK_TO_INTERRUPT
shabiel commented 6 years ago

One last thing, before I go to bed... I have had enough of this...

$rax is 0; $al is 0. So the error happens at dereferencing $rax.

nars1 commented 6 years ago

@shabiel : Related to using gdb to debug these multiple process scenarios, the following commands are very useful. Setting them to one of the two possible values listed in each bullet below gives you the flexibility to get gdb to follow the child or the parent after a fork/exec as well as control whether the other one is suspended or detached (executes concurrently). Hope this helps.

  1. set follow-fork-mode child OR set follow-fork-mode parent
  2. set follow-exec-mode new OR set follow-exec-mode same
  3. set detach-on-fork off OR set detach-on-fork on