Closed levinas closed 10 years ago
613cad65526d6b92507836777f489bf7fa0772fd
distribution 18665021 wrote coverage_distribution.txt
distribution 18665021 wrote coverage_distribution.txt-canonical
argonnite 1525321102 stops argonnite 1525321102
argonnite 1525321102 stops argonnite 56468085
argonnite 1525321102 stops
manager 1960383200 dies
kernel/1466195678 generated 11755520 kmers from 167936 entries (13995 blocks)
manager 1830730709 dies
kernel/1631472591 generated 11509120 kmers from 164416 entries (13702 blocks)
argonnite 56468085 stops
actor/56468085: permission denied, will not stop
Error, node/1 received signal SIGSEGV
bsal_tracer_print_stack_backtrace
Stack backtrace has 14 frames
#0 0 argonnite 0x000000010d8203d3 bsal_tracer_print_stack_backtrace + 67
#1 1 argonnite 0x000000010d815171 bsal_node_handle_signal + 113
#2 2 libsystem_platform.dylib 0x00007fff931775aa _sigtramp + 26
#3 3 ??? 0x0000000000000000 0x0 + 0
#4 4 argonnite 0x000000010d8241dc bsal_vector_push_back + 220
#5 5 argonnite 0x000000010d82adfd bsal_coverage_distribution_receive + 253
#6 6 argonnite 0x000000010d817105 bsal_actor_receive + 245
#7 7 argonnite 0x000000010d817988 bsal_actor_work + 72
#8 8 argonnite 0x000000010d81908c bsal_worker_work + 140
#9 9 argonnite 0x000000010d8180ca bsal_worker_run + 266
#10 10 argonnite 0x000000010d817fa8 bsal_worker_main + 24
#11 11 libsystem_pthread.dylib 0x00007fff93b59899 _pthread_body + 138
#12 12 libsystem_pthread.dylib 0x00007fff93b5972a _pthread_struct_init + 0
#13 13 libsystem_pthread.dylib 0x00007fff93b5dfc9 thread_start + 13
manager 969881812 dies
kernel/123260028 generated 11755520 kmers from 167936 entries (13995 blocks)
manager 1268070524 dies
kernel/895259492 generated 11755520 kmers from 167936 entries (13995 blocks)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 64698 RUNNING AT napoli
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault: 11 (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
can you implement the memory fence on the mac ?
it might be related.
Yes. I will do that.
[boisvert@bigmem biosal]$ git remote add fangfang https://github.com/levinas/biosal.git [boisvert@bigmem biosal]$ git fetch fangfang remote: Counting objects: 6, done. remote: Compressing objects: 100% (6/6), done. remote: Total 6 (delta 0), reused 1 (delta 0) Unpacking objects: 100% (6/6), done. From https://github.com/levinas/biosal
Review:
You can use scripts/check.sh to check for whitespace:
Detected issues: 1 core/system/timer.c:177:
Also, if application, you should include these at the end of your commit:
Link: https://github.com/GeneAssembly/biosal/issues/448 Signed-off-by: ...
The signed off by line can be added automatically with "git commit -as"
My mac is x86 and __GNUC__
is defined.
So it's using the fence then. Is it reproducible ?
Yes, it's reproducible.
Interesting. I am not seeing that on my home computer, on my laptop, on mira, on cetus, on bigmem.knowville, on sal, and on beagle.
Can you check what's the assembly code for the fence code ?
bsal_coverage_distribution_receive calls vector_push_back at line 141.
I guess there is a bug there.
Can you run it in gdb and check if the line is really 141 ?
This is the simplest run I can reproduce it (1 thread, mini.fastq, O1). If I use GDB or O0, the program exits normally.
$ ../applications/argonnite -k 31 mini.fastq
...
DEBUG bsal_vector_push_back new_size is 43, pushed in bucket 42, value in bucket 11
DEBUG bsal_vector_push_back size 43 max 64
DEBUG bsal_vector_push_back new_size is 44, pushed in bucket 43, value in bucket 44
DEBUG bsal_vector_push_back size 44 max 64
DEBUG bsal_vector_push_back new_size is 45, pushed in bucket 44, value in bucket 34
distribution 1426756063 wrote coverage_distribution.txt
distribution 1426756063 wrote coverage_distribution.txt-canonical
DEBUG bsal_vector_push_back size 0 max 0
DEBUG bsal_vector_reserve 4 buckets current_size 0
DEBUG bsal_vector_reserve old_byte_count 0 new_byte_count 16
DEBUG size 0 old 0x0 new 0x7f9411f00010
DEBUG bsal_vector_push_back size 0 max 4
DEBUG bsal_vector_push_back new_size is 1, pushed in bucket 0, value in bucket 586771433
argonnite 586771433 stops argonnite 586771433
argonnite 586771433 stops
manager 1967663920 dies
>> bsal_coverage_distribution_receive:: tag = 24699, BSAL_ACTOR_ASK_TO_STOP = 24699
DEBUG bsal_vector_push_back size 0 max 0
DEBUG bsal_vector_reserve 4 buckets current_size 0
DEBUG bsal_vector_reserve old_byte_count 0 new_byte_count 16
DEBUG size 0 old 0x0 new 0x7f9411f00010
DEBUG bsal_vector_push_back size 0 max 4
Error, node/0 received signal SIGSEGV
bsal_tracer_print_stack_backtrace
Stack backtrace has 15 frames
#0 0 argonnite 0x000000010feca193 bsal_tracer_print_stack_backtrace + 67
#1 1 argonnite 0x000000010febef31 bsal_node_handle_signal + 113
#2 2 libsystem_platform.dylib 0x00007fff8ffa15aa _sigtramp + 26
#3 3 ??? 0x657a6973206b6361 0x0 + 7312272888175485793
#4 4 argonnite 0x000000010fecdf94 bsal_vector_push_back + 180
#5 5 argonnite 0x000000010fed4c0d bsal_coverage_distribution_receive + 269
#6 6 argonnite 0x000000010fec0ec5 bsal_actor_receive + 245
#7 7 argonnite 0x000000010fec1748 bsal_actor_work + 72
#8 8 argonnite 0x000000010fec2e4c bsal_worker_work + 140
#9 9 argonnite 0x000000010fec1e8a bsal_worker_run + 266
#10 10 argonnite 0x000000010febdfbb bsal_node_run_loop + 475
#11 11 argonnite 0x000000010febdcc3 bsal_node_run + 163
#12 12 argonnite 0x000000010feba5c3 main + 99
#13 13 libdyld.dylib 0x00007fff925d15fd start + 1
#14 14 ??? 0x0000000000000004 0x0 + 4
segmentation fault
Does this offer any clue?
Was the vector somehow destroyed too soon?
Does this offer any clue?
This is debug information from vector:
DEBUG bsal_vector_push_back size 0 max 0 DEBUG bsal_vector_reserve 4 buckets current_size 0 DEBUG bsal_vector_reserve old_byte_count 0 new_byte_count 16 DEBUG size 0 old 0x0 new 0x7f9411f00010 DEBUG bsal_vector_push_back size 0 max 4
What happens:
vector is initialized:
https://github.com/sebhtml/biosal/blob/master/genomics/data/coverage_distribution.c#L134
first item is pushed into the vector:
https://github.com/sebhtml/biosal/blob/master/genomics/data/coverage_distribution.c#L141
This triggers the "resize and push_back" code path:
https://github.com/sebhtml/biosal/blob/master/core/structures/vector.c#L95
Was the vector somehow destroyed too soon?
I don't think so.
It is destroyed at this line:
https://github.com/sebhtml/biosal/blob/master/genomics/data/coverage_distribution.c#L148
Which compiler are you using ?
It is not some fancy race condition since "../applications/argonnite -k 31 mini.fastq" runs everything in the main thread.
If the bug does not appear in gdb and does not appear with -O0, maybe that some strange interaction with the compiler. For example, Linus Torvalds ran into a ugly bug in gcc recently which was triggered only with very specific arguments.
see http://www.reddit.com/r/linux/comments/2br67o/gcc_49_is_doing_some_seriously_crazy_shit/
Can you run the unit tests on your Apple computer ("make tests"). There is a unit test for vector.
vector and map are used everywhere in biosal, thorium, argonnite, and basically everywhere. So this just looks like an aggressive dispatching of instructions in the x86 out-of-order pipeline or some compiler oddity.
I saw those before by the way. To fix these you just need sometimes to change the code a bit so that the compiler change its code path.
In this case, you can remove the vector things in the tag handler for ASK_TO_STOP since it is not being used (it used to be).
Action bullets:
@levinas
Can you generate a stack trace with line numbers ?
To do that, you will need 3 things on your Apple computer:
Once you have these 3 things, do this in your terminal:
[seb@localhost ~]$ cat stack-frames.txt 0x000000010feca193 0x000000010febef31 0x00007fff8ffa15aa 0x657a6973206b6361 0x000000010fecdf94 0x000000010fed4c0d 0x000000010fec0ec5 0x000000010fec1748 0x000000010fec2e4c 0x000000010fec1e8a 0x000000010febdfbb 0x000000010febdcc3 0x000000010feba5c3 0x00007fff925d15fd 0x0000000000000004 [seb@localhost ~]$ addr2line -e argonnite < stack-frames.txt (You'll get a stack trace here)
Once you know the line number, you will likely figure out something. That's how I usually do it.
Let me know how it goes.
mpicc -v
mpicc for MPICH2 version 1.4.1p1
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin13.1.0
Thread model: posix
clang: warning: argument unused during compilation: '-I /usr/local/include'
kernel/1255914200 generated 1750000 kmers from 25000 entries (12500 blocks)
<Thorium> node 0 efficiency: 0.00
Process 31736 stopped
* thread #1: tid = 0x1ea79d8, 0x00007fff91873866 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread, stop reason = signal SIGABRT
frame #0: 0x00007fff91873866 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill + 10:
-> 0x7fff91873866: jae 0x7fff91873870 ; __pthread_kill + 20
0x7fff91873868: movq %rax, %rdi
0x7fff9187386b: jmpq 0x7fff91870175 ; cerror_nocancel
0x7fff91873870: ret
.
see: #482 test with clang 3.3 and mpich 1.4.1
@levinas this ticket should be closed and a new one should be opened to report the problem on the Mac because you ported clock_get_time already.
Ok. Move to #483.
for timer nanoseconds support
http://stackoverflow.com/questions/5167269/clock-gettime-alternative-in-mac-os-x