GeneAssembly / biosal

biosal is a distributed BIOlogical Sequence Actor Library. THIS IS A MIRROR.
BSD 2-Clause "Simplified" License
6 stars 1 forks source link

use clock_get_time on apple #448

Closed levinas closed 10 years ago

levinas commented 10 years ago

for timer nanoseconds support

http://stackoverflow.com/questions/5167269/clock-gettime-alternative-in-mac-os-x

levinas commented 10 years ago

613cad65526d6b92507836777f489bf7fa0772fd

sebhtml commented 10 years ago

https://github.com/levinas/biosal/commit/613cad65526d6b92507836777f489bf7fa0772fd

levinas commented 10 years ago
distribution 18665021 wrote coverage_distribution.txt
distribution 18665021 wrote coverage_distribution.txt-canonical
argonnite 1525321102 stops argonnite 1525321102
argonnite 1525321102 stops argonnite 56468085
argonnite 1525321102 stops
manager 1960383200 dies
kernel/1466195678 generated 11755520 kmers from 167936 entries (13995 blocks)
manager 1830730709 dies
kernel/1631472591 generated 11509120 kmers from 164416 entries (13702 blocks)
argonnite 56468085 stops
actor/56468085: permission denied, will not stop
Error, node/1 received signal SIGSEGV
bsal_tracer_print_stack_backtrace
Stack backtrace has 14 frames
#0  0   argonnite                           0x000000010d8203d3 bsal_tracer_print_stack_backtrace + 67
#1  1   argonnite                           0x000000010d815171 bsal_node_handle_signal + 113
#2  2   libsystem_platform.dylib            0x00007fff931775aa _sigtramp + 26
#3  3   ???                                 0x0000000000000000 0x0 + 0
#4  4   argonnite                           0x000000010d8241dc bsal_vector_push_back + 220
#5  5   argonnite                           0x000000010d82adfd bsal_coverage_distribution_receive + 253
#6  6   argonnite                           0x000000010d817105 bsal_actor_receive + 245
#7  7   argonnite                           0x000000010d817988 bsal_actor_work + 72
#8  8   argonnite                           0x000000010d81908c bsal_worker_work + 140
#9  9   argonnite                           0x000000010d8180ca bsal_worker_run + 266
#10  10  argonnite                           0x000000010d817fa8 bsal_worker_main + 24
#11  11  libsystem_pthread.dylib             0x00007fff93b59899 _pthread_body + 138
#12  12  libsystem_pthread.dylib             0x00007fff93b5972a _pthread_struct_init + 0
#13  13  libsystem_pthread.dylib             0x00007fff93b5dfc9 thread_start + 13
manager 969881812 dies
kernel/123260028 generated 11755520 kmers from 167936 entries (13995 blocks)
manager 1268070524 dies
kernel/895259492 generated 11755520 kmers from 167936 entries (13995 blocks)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 64698 RUNNING AT napoli
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault: 11 (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
sebhtml commented 10 years ago

can you implement the memory fence on the mac ?

it might be related.

levinas commented 10 years ago

Yes. I will do that.

sebhtml commented 10 years ago

[boisvert@bigmem biosal]$ git remote add fangfang https://github.com/levinas/biosal.git [boisvert@bigmem biosal]$ git fetch fangfang remote: Counting objects: 6, done. remote: Compressing objects: 100% (6/6), done. remote: Total 6 (delta 0), reused 1 (delta 0) Unpacking objects: 100% (6/6), done. From https://github.com/levinas/biosal

sebhtml commented 10 years ago

Review:

sebhtml commented 10 years ago

You can use scripts/check.sh to check for whitespace:

Detected issues: 1 core/system/timer.c:177:

sebhtml commented 10 years ago

Also, if application, you should include these at the end of your commit:

Link: https://github.com/GeneAssembly/biosal/issues/448 Signed-off-by: ...

The signed off by line can be added automatically with "git commit -as"

levinas commented 10 years ago

My mac is x86 and __GNUC__ is defined.

sebhtml commented 10 years ago

So it's using the fence then. Is it reproducible ?

levinas commented 10 years ago

Yes, it's reproducible.

sebhtml commented 10 years ago

Interesting. I am not seeing that on my home computer, on my laptop, on mira, on cetus, on bigmem.knowville, on sal, and on beagle.

Can you check what's the assembly code for the fence code ?

sebhtml commented 10 years ago

bsal_coverage_distribution_receive calls vector_push_back at line 141.

I guess there is a bug there.

Can you run it in gdb and check if the line is really 141 ?

levinas commented 10 years ago

This is the simplest run I can reproduce it (1 thread, mini.fastq, O1). If I use GDB or O0, the program exits normally.

$ ../applications/argonnite -k 31 mini.fastq

...
DEBUG bsal_vector_push_back new_size is 43, pushed in bucket 42, value in bucket 11
DEBUG bsal_vector_push_back size 43 max 64
DEBUG bsal_vector_push_back new_size is 44, pushed in bucket 43, value in bucket 44
DEBUG bsal_vector_push_back size 44 max 64
DEBUG bsal_vector_push_back new_size is 45, pushed in bucket 44, value in bucket 34
distribution 1426756063 wrote coverage_distribution.txt
distribution 1426756063 wrote coverage_distribution.txt-canonical
DEBUG bsal_vector_push_back size 0 max 0
DEBUG bsal_vector_reserve 4 buckets current_size 0
DEBUG bsal_vector_reserve old_byte_count 0 new_byte_count 16
DEBUG size 0 old 0x0 new 0x7f9411f00010
DEBUG bsal_vector_push_back size 0 max 4
DEBUG bsal_vector_push_back new_size is 1, pushed in bucket 0, value in bucket 586771433
argonnite 586771433 stops argonnite 586771433
argonnite 586771433 stops
manager 1967663920 dies
>> bsal_coverage_distribution_receive:: tag = 24699, BSAL_ACTOR_ASK_TO_STOP = 24699
DEBUG bsal_vector_push_back size 0 max 0
DEBUG bsal_vector_reserve 4 buckets current_size 0
DEBUG bsal_vector_reserve old_byte_count 0 new_byte_count 16
DEBUG size 0 old 0x0 new 0x7f9411f00010
DEBUG bsal_vector_push_back size 0 max 4
Error, node/0 received signal SIGSEGV
bsal_tracer_print_stack_backtrace
Stack backtrace has 15 frames
#0  0   argonnite                           0x000000010feca193 bsal_tracer_print_stack_backtrace + 67
#1  1   argonnite                           0x000000010febef31 bsal_node_handle_signal + 113
#2  2   libsystem_platform.dylib            0x00007fff8ffa15aa _sigtramp + 26
#3  3   ???                                 0x657a6973206b6361 0x0 + 7312272888175485793
#4  4   argonnite                           0x000000010fecdf94 bsal_vector_push_back + 180
#5  5   argonnite                           0x000000010fed4c0d bsal_coverage_distribution_receive + 269
#6  6   argonnite                           0x000000010fec0ec5 bsal_actor_receive + 245
#7  7   argonnite                           0x000000010fec1748 bsal_actor_work + 72
#8  8   argonnite                           0x000000010fec2e4c bsal_worker_work + 140
#9  9   argonnite                           0x000000010fec1e8a bsal_worker_run + 266
#10  10  argonnite                           0x000000010febdfbb bsal_node_run_loop + 475
#11  11  argonnite                           0x000000010febdcc3 bsal_node_run + 163
#12  12  argonnite                           0x000000010feba5c3 main + 99
#13  13  libdyld.dylib                       0x00007fff925d15fd start + 1
#14  14  ???                                 0x0000000000000004 0x0 + 4
segmentation fault

Does this offer any clue?

Was the vector somehow destroyed too soon?

sebhtml commented 10 years ago

Does this offer any clue?

This is debug information from vector:

DEBUG bsal_vector_push_back size 0 max 0 DEBUG bsal_vector_reserve 4 buckets current_size 0 DEBUG bsal_vector_reserve old_byte_count 0 new_byte_count 16 DEBUG size 0 old 0x0 new 0x7f9411f00010 DEBUG bsal_vector_push_back size 0 max 4

What happens:

vector is initialized:

https://github.com/sebhtml/biosal/blob/master/genomics/data/coverage_distribution.c#L134

first item is pushed into the vector:

https://github.com/sebhtml/biosal/blob/master/genomics/data/coverage_distribution.c#L141

This triggers the "resize and push_back" code path:

https://github.com/sebhtml/biosal/blob/master/core/structures/vector.c#L95

Was the vector somehow destroyed too soon?

I don't think so.

It is destroyed at this line:

https://github.com/sebhtml/biosal/blob/master/genomics/data/coverage_distribution.c#L148

Which compiler are you using ?

It is not some fancy race condition since "../applications/argonnite -k 31 mini.fastq" runs everything in the main thread.

If the bug does not appear in gdb and does not appear with -O0, maybe that some strange interaction with the compiler. For example, Linus Torvalds ran into a ugly bug in gcc recently which was triggered only with very specific arguments.

see http://www.reddit.com/r/linux/comments/2br67o/gcc_49_is_doing_some_seriously_crazy_shit/

Can you run the unit tests on your Apple computer ("make tests"). There is a unit test for vector.

vector and map are used everywhere in biosal, thorium, argonnite, and basically everywhere. So this just looks like an aggressive dispatching of instructions in the x86 out-of-order pipeline or some compiler oddity.

I saw those before by the way. To fix these you just need sometimes to change the code a bit so that the compiler change its code path.

In this case, you can remove the vector things in the tag handler for ASK_TO_STOP since it is not being used (it used to be).

Action bullets:

sebhtml commented 10 years ago

@levinas

Can you generate a stack trace with line numbers ?

To do that, you will need 3 things on your Apple computer:

Once you have these 3 things, do this in your terminal:

[seb@localhost ~]$ cat stack-frames.txt 0x000000010feca193 0x000000010febef31 0x00007fff8ffa15aa 0x657a6973206b6361 0x000000010fecdf94 0x000000010fed4c0d 0x000000010fec0ec5 0x000000010fec1748 0x000000010fec2e4c 0x000000010fec1e8a 0x000000010febdfbb 0x000000010febdcc3 0x000000010feba5c3 0x00007fff925d15fd 0x0000000000000004 [seb@localhost ~]$ addr2line -e argonnite < stack-frames.txt (You'll get a stack trace here)

Once you know the line number, you will likely figure out something. That's how I usually do it.

Let me know how it goes.

levinas commented 10 years ago
mpicc -v

mpicc for MPICH2 version 1.4.1p1
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin13.1.0
Thread model: posix
clang: warning: argument unused during compilation: '-I /usr/local/include'
levinas commented 10 years ago
kernel/1255914200 generated 1750000 kmers from 25000 entries (12500 blocks)
<Thorium> node 0 efficiency: 0.00
Process 31736 stopped
* thread #1: tid = 0x1ea79d8, 0x00007fff91873866 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread, stop reason = signal SIGABRT
    frame #0: 0x00007fff91873866 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill + 10:
-> 0x7fff91873866:  jae    0x7fff91873870            ; __pthread_kill + 20
   0x7fff91873868:  movq   %rax, %rdi
   0x7fff9187386b:  jmpq   0x7fff91870175            ; cerror_nocancel
   0x7fff91873870:  ret
sebhtml commented 10 years ago

.

sebhtml commented 10 years ago

see: #482 test with clang 3.3 and mpich 1.4.1

sebhtml commented 10 years ago

@levinas this ticket should be closed and a new one should be opened to report the problem on the Mac because you ported clock_get_time already.

levinas commented 10 years ago

Ok. Move to #483.