Closed sebhtml closed 10 years ago
ran out of memory, reduce the amount of concurrent active messages.
AUTO-SCALING kernel 228292686 receives auto-scale message (BSAL_ACTOR_DO_AUTO_SCALING) via actor 228292686 kernel 364932174 is online !!! DEBUG Error bsal_memory_allocate returned (nil), 132691252 bytes bsal_tracer_print_stack_backtrace Stack backtrace has 11 frames
DEBUG Error bsal_memory_allocate returned (nil), 132691252 bytes bsal_tracer_print_stack_backtrace
dependency: #426
[boisvert@miralac1 biosal-tests]$ ./biosal-Iowa-24.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 300743
auto-scaling is still enabled...
[boisvert@miralac1 biosal-tests]$ grep AUTO biosal-Iowa-24.output|grep enables | grep node|wc -l 1885
[boisvert@miralac1 biosal-tests]$ grep 111740947 biosal-Iowa-24.output | head kernel 111740947 is online !!! kernel 111740947 processed 25998 entries (1 blocks) so far AUTO-SCALING kernel 111740947 enables auto-scaling (BSAL_ACTOR_ENABLE_AUTO_SCALING) via actor 77834752 AUTO-SCALING node/19 enables auto-scaling for actor 111740947 (BSAL_ACTOR_ENABLE_AUTO_SCALING)
nevermind, the limit is 0
the timer code in thorium is broken for blue gene
log file:
biosal-Iowa-24.output
[boisvert@miralac1 biosal-tests]$ addr2line -e argonnite < biosal-Iowa-24.output.stack /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/core/system/tracer.c:36 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/core/system/memory.c:97 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/core/system/memory_block.c:30 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/core/system/memory_pool.c:167 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/core/system/memory_pool.c:80 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/genomics/kernels/aggregator.c:172 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/engine/thorium/dispatcher.c:75 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/engine/thorium/actor.c:1243 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/engine/thorium/actor.c:1827 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/engine/thorium/worker.c:1148 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/engine/thorium/worker.c:246 /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/nptl/pthread_create.c:322 :0 ??:0
DEBUG Error bsal_memory_allocate returned (nil), 8388608 bytes
[boisvert@miralac1 biosal-tests]$ grep "kmer store" biosal-Iowa-24.output |grep coverage|wc -l 15872 [boisvert@miralac1 biosal-tests]$ echo $((512 * 31)) 15872
memory usage is not uniform...
completion is not uniform:
[boisvert@miralac1 biosal-tests]$ grep left biosal-Iowa-24.output | tail -n 15872|awk '{print $6}'|sort | uniq -c 474 (0.63) 42 (0.64) 1183 (0.81) 853 (0.82)
[boisvert@miralac1 biosal-tests]$ ./biosal-Iowa-25.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 301012
.
.
.
.
Almost there:
sequence store 372103113 has 30234/286720 (0.11) entries left to produce sequence store 1192878683 has 32168/290816 (0.11) entries left to produce sequence store 1377052464 has 32168/290816 (0.11) entries left to produce sequence store 1875153060 has 11900/290816 (0.04) entries left to produce sequence store 1235229640 has 30234/286720 (0.11) entries left to produce
DEBUG Error bsal_memory_allocate returned (nil), 32104524 bytes
[boisvert@miralac1 biosal-tests]$ addr2line -e argonnite < biosal-Iowa-25.stack /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/core/system/tracer.c:40 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/core/system/memory.c:97 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/core/system/memory_pool.c:121 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/core/system/memory_pool.c:80 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/genomics/kernels/dna_kmer_counter_kernel.c:281 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/engine/thorium/actor.c:899 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/engine/thorium/actor.c:1829 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/engine/thorium/worker.c:1148 /gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests/biosal/engine/thorium/worker.c:246 /bgsys/drivers/V1R2M1/ppc64/toolchain/gnu/glibc-2.12.2/nptl/pthread_create.c:322 :0 ??:0
[boisvert@miralac1 biosal-tests]$ grep MEMORY biosal-Iowa-25.output|tail -n 512|awk '{print $6}'|sort -r -n|head 16441671680 9093050368 9090310144 9083207680 9080508416 9079181312 9073364992 9045626880 9034493952 9030164480
somehow, node 260 had a problem (?):
[boisvert@miralac1 biosal-tests]$ grep "node/260" biosal-Iowa-25.output|grep MEMORY|tail
loads:
260 is receiving too much messages.
probably some sort of strange kmer (NNNNNNNNNNNNN)
Thorium counters:
[boisvert@cetuslac1 biosal-tests]$ grep BSAL_COUNTER_BALANCE_MESSAGES biosal-Iowa-cetus-1.output | grep ^"259 " | tail 259 balance BSAL_COUNTER_BALANCE_MESSAGES 16532 259 balance BSAL_COUNTER_BALANCE_MESSAGES 14020 259 balance BSAL_COUNTER_BALANCE_MESSAGES -541 259 balance BSAL_COUNTER_BALANCE_MESSAGES 6844 259 balance BSAL_COUNTER_BALANCE_MESSAGES 5477 259 balance BSAL_COUNTER_BALANCE_MESSAGES -7096 259 balance BSAL_COUNTER_BALANCE_MESSAGES 9529 259 balance BSAL_COUNTER_BALANCE_MESSAGES 4080 259 balance BSAL_COUNTER_BALANCE_MESSAGES 6408 259 balance BSAL_COUNTER_BALANCE_MESSAGES 6859 [boisvert@cetuslac1 biosal-tests]$ grep BSAL_COUNTER_BALANCE_MESSAGES biosal-Iowa-cetus-1.output | grep ^"261 " | tail 261 balance BSAL_COUNTER_BALANCE_MESSAGES 3349 261 balance BSAL_COUNTER_BALANCE_MESSAGES -2909 261 balance BSAL_COUNTER_BALANCE_MESSAGES 863 261 balance BSAL_COUNTER_BALANCE_MESSAGES -6086 261 balance BSAL_COUNTER_BALANCE_MESSAGES 3030 261 balance BSAL_COUNTER_BALANCE_MESSAGES -1218 261 balance BSAL_COUNTER_BALANCE_MESSAGES -806 261 balance BSAL_COUNTER_BALANCE_MESSAGES -7859 261 balance BSAL_COUNTER_BALANCE_MESSAGES -5600 261 balance BSAL_COUNTER_BALANCE_MESSAGES 1713 [boisvert@cetuslac1 biosal-tests]$ grep BSAL_COUNTER_BALANCE_MESSAGES biosal-Iowa-cetus-1.output | grep ^"260 " | tail 260 balance BSAL_COUNTER_BALANCE_MESSAGES 413592 260 balance BSAL_COUNTER_BALANCE_MESSAGES 432511 260 balance BSAL_COUNTER_BALANCE_MESSAGES 453165 260 balance BSAL_COUNTER_BALANCE_MESSAGES 503060 260 balance BSAL_COUNTER_BALANCE_MESSAGES 516621 260 balance BSAL_COUNTER_BALANCE_MESSAGES 541214 260 balance BSAL_COUNTER_BALANCE_MESSAGES 569961 260 balance BSAL_COUNTER_BALANCE_MESSAGES 614067 260 balance BSAL_COUNTER_BALANCE_MESSAGES 658516 260 balance BSAL_COUNTER_BALANCE_MESSAGES 691073
on cetus
[boisvert@cetuslac1 biosal-tests]$ ./biosal-Iowa-cetus-2.sh 301277
[boisvert@cetuslac1 biosal-tests]$ sha1sum coverage_distribution.txt-canonical 01a293db48518190038eaddbaed8a47ca0323fc7 coverage_distribution.txt-canonical [boisvert@cetuslac1 biosal-tests]$ cbank list jobs -p CompBIO|grep 301372.cetus
running time:
0:52:40
[boisvert@cetuslac1 biosal-tests]$ cat biosal-Iowa-cetus-4.sh
#!/bin/bash
qsub \
-A CompBIO \
-n 512 \
-t 01:00:00 \
-O biosal-Iowa-cetus-4 \
--mode c1 \
argonnite -print-counters -print-load -print-memory-usage -threads-per-node 16 -k 43 Iowa_Continuous_Corn/*.fastq
[boisvert@cetuslac1 biosal-tests]$ grep TIMER biosal-Iowa-cetus-4.output
TIMER
/gpfs/mira-fs1/projects/CompBIO/Projects/biosal-tests
biosal-Iowa-23 is queued (300250)