GeneAssembly / biosal

biosal is a distributed BIOlogical Sequence Actor Library. THIS IS A MIRROR.
BSD 2-Clause "Simplified" License
6 stars 1 forks source link

run Iowa sample on Beagle #427

Closed sebhtml closed 10 years ago

sebhtml commented 10 years ago

/lustre/beagle/CompBIO/biosal-THOR

biosal-405-256-nodes-13 is queued (2775124)

sebhtml commented 10 years ago

dependency: #426

sebhtml commented 10 years ago

[NID 00636] 2014-07-22 14:15:45 Apid 4747407: initiated application termination [NID 00636] 2014-07-22 14:15:48 Apid 4747407: OOM killer terminated this process. biosal-405-256-nodes-13.e2775124 lines 1-2/2 (END)

sebhtml commented 10 years ago

Beagle) qsub biosal-405-256-nodes-14.pbs 2775697.sdb

sebhtml commented 10 years ago

Previous job (https://anl.app.box.com/files/0/f/2209403957/1/f_19075724265):

Expected:

Beagle) sha1sum coverage_distribution.txt-canonical 01a293db48518190038eaddbaed8a47ca0323fc7 coverage_distribution.txt-canonical

Actual:

Beagle) sha1sum coverage_distribution.txt-canonical 01a293db48518190038eaddbaed8a47ca0323fc7 coverage_distribution.txt-canonical

sebhtml commented 10 years ago

But the job did a segmentation fault at the end:

Segmentation fault

bsal_tracer_print_stack_backtrace TRACE IS NOT AVAILABLE.

sebhtml commented 10 years ago

rdi 0x0 0

sebhtml commented 10 years ago

With GNU toolchain (to get a backtrace):

Beagle) qsub biosal-405-256-nodes-15.pbs 2775909.sdb

sebhtml commented 10 years ago

addr2line is broken on the Cray...

Beagle) objdump -d argonnite > biosal-405-256-nodes-15.s

let's do it with disassemble-and-get-stack.rb

sebhtml commented 10 years ago

Beagle) biosal/scripts/disassemble-and-get-stack.py -e argonnite < biosal-405-256-nodes-15.stack

0 0x413bb4 bsal_tracer_print_stack_backtrace> 413bb4: 48 89 e7 mov %rsp,%rdi

1 0x404f14 bsal_node_handle_signal> 404f14: 48 8b 3d 35 97 47 00 mov 0x479735(%rip),%rdi # 87e650 <_IO_stdout>

2 0x54dae0 __restore_rt> 54dae0: 48 c7 c0 0f 00 00 00 mov $0xf,%rax

3 0x408900 bsal_actor_script> 408900: 48 8b 7f 08 mov 0x8(%rdi),%rdi

4 0x40c89b bsal_worker_pool_give_message_to_actor> 40c89b: 48 89 df mov %rbx,%rdi

5 0x40c98d bsal_worker_pool_work> 40c98d: e9 70 ff ff ff jmpq 40c902 <bsal_worker_pool_work+0x22>

6 0x4070f6 bsal_node_run_loop> 4070f6: 83 fd 00 cmp $0x0,%ebp

7 0x407439 bsal_node_run> 407439: 8b 83 70 10 00 00 mov 0x1070(%rbx),%eax

8 0x400d9e main> 400d9e: 48 8d 7c 24 10 lea 0x10(%rsp),%rdi

9 0x554954 __libc_start_main> 554954: 89 c7 mov %eax,%edi

10 0x400ddd _start> 400ddd: f4 hlt

sebhtml commented 10 years ago

biosal-405-256-nodes-15.o2775909 walltime=00:15:15

Stack backtrace has 11 frames

0 [0x413bb4]

1 [0x404f14]

2 [0x54dae0]

3 [0x408900]

4 [0x40c89b]

5 [0x40c98d]

6 [0x4070f6]

7 [0x407439]

8 [0x400d9e]

9 [0x554954]

10 [0x400ddd]

sebhtml commented 10 years ago

addr2line is broken on Beagle:

Beagle) addr2line -e argonnite < biosal-405-256-nodes-15.stack ??:0 ??:0 sigaction.c:0 ??:0 ??:0 ??:0 ??:0 ??:0 /lustre/beagle/CompBIO/biosal-THOR/biosal/applications/argonnite_kmer_counter/main.c:14 /usr/src/packages/BUILD/glibc-2.11.3/csu/libc-start.c:226 /usr/src/packages/BUILD/glibc-2.11.3/csu/../sysdeps/x86_64/elf/start.S:116

sebhtml commented 10 years ago

_pmiu_daemon(SIGCHLD): [NID 00310] [c6-0c2s4n0] [Wed Jul 23 15:08:04 2014] PE RANK 255 exit signal Segmentation fault [NID 00310] 2014-07-23 15:08:04 Apid 4749597: initiated application termination

sebhtml commented 10 years ago

Beagle) qsub biosal-405-256-nodes-16.pbs 2776076.sdb

sebhtml commented 10 years ago

Error, node/0 received signal SIGSEGV bsal_tracer_print_stack_backtrace Stack backtrace has 10 frames

0 [0x413c64]

1 [0x404f14]

2 [0x54dba0]

3 [0x4027ca]

4 [0x40a4a6]

5 [0x40a5ce]

6 [0x40b9cf]

7 [0x40bcea]

8 [0x549c06]

9 [0x5930a9]

0 [0x413c64] bsal_tracer_print_stack_backtrace> 413c64: 48 89 e7 mov %rsp,%rdi

1 [0x404f14] bsal_node_handle_signal> 404f14: 48 8b 3d 35 97 47 00 mov 0x479735(%rip),%rdi # 87e650 <_IO_stdout>

2 [0x54dba0] __restore_rt> 54dba0: 48 c7 c0 0f 00 00 00 mov $0xf,%rax

3 [0x4027ca] argonnite_receive> 4027ca: 83 00 01 addl $0x1,(%rax)

4 [0x40a4a6] bsal_actor_receive> 40a4a6: eb 87 jmp 40a42f <bsal_actor_receive+0x4f>

5 [0x40a5ce] bsal_actor_work> 40a5ce: 48 89 ee mov %rbp,%rsi

6 [0x40b9cf] bsal_worker_work> 40b9cf: 31 f6 xor %esi,%esi

7 [0x40bcea] bsal_worker_main> 40bcea: 4c 89 ef mov %r13,%rdi

8 [0x549c06] start_thread> 549c06: 64 48 89 04 25 30 06 mov %rax,%fs:0x630

9 [0x5930a9] __clone> 5930a9: 48 89 c7 mov %rax,%rdi

sebhtml commented 10 years ago

4027bd: 4c 89 ef mov %r13,%rdi 4027c0: e8 1b 6d 01 00 callq 4194e0 4027c5: 48 89 44 24 38 mov %rax,0x38(%rsp) 4027ca: 83 00 01 addl $0x1,(%rax) 4027cd: 4c 89 f7 mov %r14,%rdi

sebhtml commented 10 years ago

Beagle) qsub biosal-405-256-nodes-17.pbs 2776121.sdb

sebhtml commented 10 years ago

Beagle) qsub biosal-405-256-nodes-18.pbs 2776124.sdb

sebhtml commented 10 years ago

Beagle) qsub biosal-405-256-nodes-20.pbs 2777421.sdb

Beagle) showq | grep sebh 2777421 sebhtml Running 6144 00:59:56 Fri Jul 25 03:37:28

sebhtml commented 10 years ago

without counters:

Beagle) qsub biosal-405-256-nodes-21.pbs 2777422.sdb

sebhtml commented 10 years ago

action points:

check iterations 20 and 21

sebhtml commented 10 years ago

biosal-405-256-nodes-20.o2777421 with counters 00:21:34

biosal-405-256-nodes-21.o2777422 walltime=00:21:48 01a293db48518190038eaddbaed8a47ca0323fc7 coverage_distribution.txt-canonical

Beagle) grep efficiency biosal-405-256-nodes-21.stdout |tail

node 16 efficiency: 0.18 node 28 efficiency: 0.18 node 14 efficiency: 0.19 node 32 efficiency: 0.18 node 12 efficiency: 0.17 node 10 efficiency: 0.18 node 17 efficiency: 0.19 node 30 efficiency: 0.18 node 29 efficiency: 0.18 node 19 efficiency: 0.19 TIMER 4 minutes, 51.190277 seconds TIMER 3 minutes, 57.567780 seconds TIMER 8 minutes, 48.758057 seconds