Closed sebhtml closed 10 years ago
with CORE_MEMORY_DEBUG
Iteration 1
[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 129995
[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 98142 0xc170626e 15292 0x9739a8fa 3824 0x146f7d15 3317 0x2d9d481 3225 0x37ddf367 2800 0x185945f7 1008 0x89e9235d 803 0x46d316e4 280 0x1d4f2792 224 0x78e238cd
Iteration 2
[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 119510 [boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 101438 0xc170626e 4553 0x146f7d15 3315 0x2d9d481 3223 0x37ddf367 2800 0x185945f7 1008 0x89e9235d 766 0x46d316e4 532 0x1d4f2792 419 0x9739a8fa 395 0x2ee1c5a6
core/system/memory_pool.c:#define MEMORY_MEMORY_POOL 0xc170626e
Iteration 3
[boisvert@bigmem biosal]$ grep 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 36438 8 15380 16 13205 32 10272 64 8621 24 3910 1024 3563 4608 3011 128 2120 48 1979 256
[boisvert@bigmem biosal]$ grep worker_run log|wc -l
36052
[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 127318 0x28fc4a42 127316 0x771872d8 39476 0xc170626e 4558 0x146f7d15 3314 0x2d9d481 3222 0x37ddf367 2800 0x185945f7 1008 0x89e9235d 764 0x46d316e4 532 0x1d4f2792
[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 87830
[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 69743 0xc170626e 4559 0x146f7d15 3315 0x2d9d481 3223 0x37ddf367 2800 0x185945f7 1008 0x89e9235d 764 0x46d316e4 532 0x1d4f2792 419 0x9739a8fa 395 0x2ee1c5a6
by size:
[boisvert@bigmem biosal]$ grep 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 30109 16368 16283 8 8607 24 3688 1024 3564 3584 2009 48 1051 4 625 16 561 32 447 512
[boisvert@miralac1 automated-tests]$ addr2line -e spate-2014-11-15-04-07-25.spate < spate-2014-11-15-04-07-25.stack /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/tracer.c:42 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory.c:135 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory_pool.c:161 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/structures/simple_queue.c:174 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory_pool.c:546 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/genomics/data/dna_kmer.c:47 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/genomics/assembly/assembly_arc.c:27 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/genomics/assembly/assembly_arc_block.c:224 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/genomics/assembly/assembly_arc_classifier.c:132 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/actor.c:257 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/node.c:1754 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:1927 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:1685 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:713 /bgsys/drivers/V1R2M2/ppc64/toolchain/gnu/glibc-2.12.2/nptl/pthread_create.c:322 :0
69743 0xc170626e 30109 16368
This one can be removed by using the segments directly to store a linked list.
[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 60478 [boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 42737 0xc170626e 4542 0x146f7d15 3315 0x2d9d481 3223 0x37ddf367 2800 0x185945f7 1008 0x89e9235d 758 0x46d316e4 523 0x1d4f2792 419 0x9739a8fa 395 0x2ee1c5a6 [boisvert@bigmem biosal]$ grep 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 17117 8 9176 32 3992 64 3904 2048 3724 1024 1884 4 633 16 455 128 449 512 367 32768
irb(main):002:0> 100.0 * (60478 - 87830) / 87830 => -31.14197882272572
[boisvert@bigmem biosal]$ grep 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 15669 8 9169 32 3901 2048 3724 1024 1114 64 627 16 469 128 447 512 436 4 369 32768
[boisvert@bigmem biosal]$ grep -v 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 4574 16 3248 131072 3223 24576 2800 2480 947 1048576 521 2097152 336 24 291 4194304 283 16777216 224 2768
[boisvert@bigmem biosal]$ grep ^# log|awk '{print $2}'|sort|uniq -c|sort -r -n|head -n 20 32169 ./applications/spate_metagenome_assembler/spate() 16737 ./applications/spate_metagenome_assembler/spate(core_hash_table_add+0xcd) 16402 ./applications/spate_metagenome_assembler/spate(core_tracer_print_stack_backtrace+0x24) 16397 ./applications/spate_metagenome_assembler/spate(core_memory_pool_allocate+0x261) 16397 ./applications/spate_metagenome_assembler/spate(core_memory_allocate_private+0xbe) 16156 /lib64/libpthread.so.0() 16156 /lib64/libc.so.6(clone+0x6d) 16155 ./applications/spate_metagenome_assembler/spate(thorium_worker_main+0x23) 14601 ./applications/spate_metagenome_assembler/spate(thorium_worker_run+0x22b) 14601 ./applications/spate_metagenome_assembler/spate(thorium_actor_work+0xc8) SUSPICIOUS 7609 ./applications/spate_metagenome_assembler/spate(core_hash_table_group_init+0x48) SUSPICIOUS 7608 ./applications/spate_metagenome_assembler/spate(core_hash_table_group_init+0xa3) 7254 ./applications/spate_metagenome_assembler/spate(core_map_add_value+0x4b) 7190 ./applications/spate_metagenome_assembler/spate(core_memory_pool_allocate+0x20c) 7125 ./applications/spate_metagenome_assembler/spate(core_memory_pool_free_private+0xfd) 7125 ./applications/spate_metagenome_assembler/spate(core_memory_pool_free+0x12d) SUSPICIOUS 6046 ./applications/spate_metagenome_assembler/spate(biosal_dna_kmer_destroy+0x14) 5600 ./applications/spate_metagenome_assembler/spate(biosal_vertex_neighborhood_receive+0xe6) SUSPICIOUS 5599 ./applications/spate_metagenome_assembler/spate(core_vector_pack_unpack+0xcc) SUSPÎCIOUS** 5598 ./applications/spate_metagenome_assembler/spate(biosal_unitig_visitor_receive+0x23b)
[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 39965 [boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 22062 0xc170626e 4622 0x146f7d15 3315 0x2d9d481 3223 0x37ddf367 2799 0x185945f7 1008 0x89e9235d 766 0x46d316e4 588 0x1d4f2792 419 0x9739a8fa 395 0x2ee1c5a6 [boisvert@bigmem biosal]$ grep 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 9192 8 4340 32 3628 2048 2294 64 1152 4 389 512 368 128 278 32768 193 16 72 256
[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 35735
irb(main):001:0> (35735.0-39965)/39965*100 => -10.584261228575004
Reduction since start: irb(main):009:0> 100.0*(35735-129995)/129995 => -72.51048117235278
-72%
latency_probe
[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 8610 [boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 3408 0xc170626e 2508 0x2d9d481 2428 0x37ddf367 119 0x146f7d15 39 0x2c6a06d3 24 0x97b478ba 24 0x46d316e4 24 0x323f94b8 24 0x2ee1c5a6 4 0xf3ad5880 [boisvert@bigmem biosal]$ grep 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 2406 128 290 8 258 32 109 32768 109 2048 78 16 53 256 33 512 33 1024 14 4
spate
[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 35758 [boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 17790 0xc170626e 4655 0x146f7d15 3315 0x2d9d481 3223 0x37ddf367 2800 0x185945f7 1008 0x89e9235d 766 0x46d316e4 588 0x1d4f2792 419 0x9739a8fa 395 0x2ee1c5a6 [boisvert@bigmem biosal]$ grep 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 8164 8 4376 32 3664 2048 389 512 369 128 277 32768 185 16 82 64 72 256 70 1024
Between vertices processing and arc processing:
[boisvert@bigmem biosal]$ cat log9|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 857 24 817 8 578 1048576 414 32 291 4194304 286 512 283 16777216 136 2097152 117 2048 110 32768
JobName Goal Trace memory allocations, one stdout file per node.
Machine AllocationStatus Path /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests
Commit added CORE_DEBUG_MEMORY [boisvert@cetuslac1 biosal]$ ./scripts/IBM_Blue_Gene_Q/build-spate-xl.sh
Toolchain Script added -freopen-stdout
[boisvert@cetuslac1 biosal]$ cat /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/spate-2014-11-19-02-24-12.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 01:00:00 \ -O spate-2014-11-19-02-24-12 \ --mode c1 \ spate-2014-11-19-02-24-12.spate -print-load -threads-per-node 16 \ -k 27 Iowa_Continuous_Corn/*.fastq \ -o spate-2014-11-19-02-24-12 -freopen-stdout
Submission [boisvert@cetuslac1 biosal]$ tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-19-02-24-12 (c2ac9bd52207e565955fe93185f7bce21725c444)
[boisvert@cetuslac1 biosal]$ cp applications/spate_metagenome_assembler/spate /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/spate-2014-11-19-02-24-12.spate
[boisvert@cetuslac1 automated-tests]$ ./spate-2014-11-19-02-24-12.sh 366861
MachineUtilization ComputationLoad RunningTime ran for like 30 minutes, then:
[boisvert@cetuslac1 automated-tests]$ tail spate-2014-11-19-02-24-12.cobaltlog Wed Nov 19 02:30:06 2014 +0000 (UTC) submitted with cwd set to: /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests Wed Nov 19 02:39:20 2014 +0000 (UTC) boisvert/366861: Initiating boot at location CET-40000-73371-1024. Wed Nov 19 02:45:49 2014 +0000 (UTC) boisvert/366861: Block CET-40000-73371-1024 free, retrying boot. Wed Nov 19 02:45:50 2014 +0000 (UTC) boisvert/366861: Initiating boot at location CET-40000-73371-1024. Wed Nov 19 02:52:22 2014 +0000 (UTC) boisvert/366861: Block CET-40000-73371-1024 free, retrying boot. Wed Nov 19 02:52:23 2014 +0000 (UTC) boisvert/366861: Initiating boot at location CET-40000-73371-1024. Wed Nov 19 02:58:59 2014 +0000 (UTC) boisvert/366861: Block CET-40000-73371-1024 free, retrying boot. Wed Nov 19 02:59:00 2014 +0000 (UTC) boisvert/366861: Initiating boot at location CET-40000-73371-1024. Wed Nov 19 03:05:32 2014 +0000 (UTC) boisvert/366861: Boot terminated on CET-40000-73371-1024: too many reboot attempts. Wed Nov 19 03:05:39 2014 +0000 (UTC) Info: task completed normally with an exit code of 1234567; initiating job cleanup and removal
[boisvert@cetuslac1 automated-tests]$ ls|grep spate-2014-11-19-02-24-12 spate-2014-11-19-02-24-12.cobaltlog spate-2014-11-19-02-24-12.job spate-2014-11-19-02-24-12.sh spate-2014-11-19-02-24-12.spate
MemoryUtilization Checksum GoodComments BadComments NeutralComments
Note 1 After 17 minutes, there is still nothing written (no output directory and no stdout log files)
366861 spate-2014-11-19-02-24-12 boisvert 0.7 01:00:00 00:08:44 00:17:10 1024 running CET-40000-73371-1024 c1 1024 default Wed Nov 19 02:38:50 2014 +0000 (UTC) [boisvert@cetuslac1 automated-tests]$ ls|grep spate-2014-11-19-02-24-12 spate-2014-11-19-02-24-12.cobaltlog spate-2014-11-19-02-24-12.job spate-2014-11-19-02-24-12.sh spate-2014-11-19-02-24-12.spate
JobName Goal understand memory fragmentation with tracepoints, also I hope that this one does not hit the reboot issue.
Machine AllocationStatus Path Commit Toolchain Script [boisvert@cetuslac1 automated-tests]$ cat spate-2014-11-19-02-24-12-2.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 01:00:00 \ -O spate-2014-11-19-02-24-12-2 \ --mode c1 \ spate-2014-11-19-02-24-12-2.spate -print-load -threads-per-node 16 \ -k 27 Iowa_Continuous_Corn/*.fastq \ -o spate-2014-11-19-02-24-12-2 -freopen-stdout
Submission [boisvert@cetuslac1 automated-tests]$ ./spate-2014-11-19-02-24-12-2.sh 366890
MachineUtilization ComputationLoad RunningTime
[boisvert@cetuslac1 automated-tests]$ grep TIMER spate-2014-11-19-02-24-12-2.* spate-2014-11-19-02-24-12-2.01023.txt:TIMER [Load input / Count input data] 29.850824 seconds Binary file spate-2014-11-19-02-24-12-2.spate matches
All but one log file is empty:
[boisvert@cetuslac1 automated-tests]$ ls -lh|grep spate-2014-11-19-02-24-12-2|head
-rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00000.txt
-rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00001.txt
-rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00002.txt
-rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00003.txt
-rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00004.txt
-rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00005.txt
-rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00006.txt
-rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00007.txt
-rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00008.txt
-rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00009.txt
[boisvert@cetuslac1 automated-tests]$ ls -lh|grep spate-2014-11-19-02-24-12-2|tail
-rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.01019.txt
-rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.01020.txt
-rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.01021.txt
-rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.01022.txt
-rw-r--r-- 1 boisvert CompBIO 1.4M Nov 19 04:10 spate-2014-11-19-02-24-12-2.01023.txt
-rw-r--r-- 1 boisvert CompBIO 7.3K Nov 19 05:08 spate-2014-11-19-02-24-12-2.cobaltlog
-rw-r--r-- 1 boisvert CompBIO 925 Nov 19 04:09 spate-2014-11-19-02-24-12-2.error
-rw-r--r-- 1 boisvert CompBIO 3.4M Nov 19 04:09 spate-2014-11-19-02-24-12-2.output
-rwxr-xr-x 1 boisvert CompBIO 365 Nov 19 03:55 spate-2014-11-19-02-24-12-2.sh
-rwxr-xr-x 1 boisvert CompBIO 58M Nov 19 03:55 spate-2014-11-19-02-24-12-2.spate
MemoryUtilization Checksum GoodComments waiting for [alcf-support #240169] => ALCF can not help becauase there is not enough information in logs.
BadComments NeutralComments
JobName Goal Try the freopen code path on Mira
Machine Mira
AllocationStatus Path Commit Toolchain Script [boisvert@cetuslac1 automated-tests]$ cat spate-2014-11-19-02-24-12-3.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 01:00:00 \ -O spate-2014-11-19-02-24-12-3 \ --mode c1 \ spate-2014-11-19-02-24-12-3.spate -print-load -threads-per-node 16 \ -k 27 Iowa_Continuous_Corn/*.fastq \ -o spate-2014-11-19-02-24-12-3 -freopen-stdout
Submission [boisvert@miralac1 automated-tests]$ ./spate-2014-11-19-02-24-12-3.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 367136
MachineUtilization ComputationLoad RunningTime
[boisvert@cetuslac1 automated-tests]$ grep TIMER spate-2014-11-19-02-24-12-3.* spate-2014-11-19-02-24-12-3.01023.txt:TIMER [Load input / Count input data] 38.665909 seconds spate-2014-11-19-02-24-12-3.01023.txt:TIMER [Load input / Distribute input data] 1 minutes, 29.921494 seconds spate-2014-11-19-02-24-12-3.01023.txt:TIMER [Load input] 2 minutes, 8.587402 seconds spate-2014-11-19-02-24-12-3.01021.txt:TIMER [Build assembly graph / Distribute vertices] 8 minutes, 48.348511 seconds
spate-2014-11-19-02-24-12-4 took 32 minutes for the vertices.
MemoryUtilization [boisvert@cetuslac1 automated-tests]$ grep "TRACEPOINT memory:allocate" spate-2014-11-19-02-24-12-3.01020.txt|awk '{print $4}'|sort -n|uniq -c|sort -r -n|head 7365 24 <================ 5615 1048576 2302 2097152 1172 32768 395 8 309 32 294 8388608 252 420 128 131072 117 2048
Here are the API keys that are used to allocate 24 bytes:
[boisvert@cetuslac1 automated-tests]$ grep "memory:allocate" spate-2014-11-19-02-24-12-3.01020.txt|grep "size= 24"|awk '{print $14}'|sort|uniq -c|sort -r -n|head 4103 0x1d4f2792 1497 0xa156d064 740 0x46d316e4 540 0xee1344f0 270 0x89e9235d 95 0x2d9d481 94 0x9739a8fa 75 0x2ee1c5a6 15 0x97b478ba 15 0x84a83916
[boisvert@bigmem biosal]$ grep 0x1d4f2792 * -R genomics/assembly/assembly_arc_classifier.c:#define MEMORY_NAME_ARC_CLASSIFIER 0x1d4f2792
[boisvert@bigmem biosal]$ grep 0xa156d064 * -R genomics/assembly/assembly_block_classifier.c:#define MEMORY_POOL_NAME_VERTEX_CLASSIFIER 0xa156d064
Checksum GoodComments spate-2014-11-19-02-24-12-3.01020.txt contains tracepoint events.
BadComments NeutralComments
Traces:
spate-2014-11-19-02-24-12-3.01020.txt-0xa156d064
JobName Goal Try without freopen on Cetus
Machine AllocationStatus Path Commit Toolchain Script [boisvert@cetuslac1 automated-tests]$ cat spate-2014-11-19-02-24-12-4.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 01:00:00 \ -O spate-2014-11-19-02-24-12-4 \ --mode c1 \ spate-2014-11-19-02-24-12-4.spate -print-load -threads-per-node 16 \ -k 27 Iowa_Continuous_Corn/*.fastq \ -o spate-2014-11-19-02-24-12-4
Submission [boisvert@cetuslac1 automated-tests]$ ./spate-2014-11-19-02-24-12-4.sh 367151
MachineUtilization ComputationLoad RunningTime [boisvert@cetuslac1 automated-tests]$ grep TIMER spate-2014-11-19-02-24-12-4.output TIMER [Load input / Count input data] 42.953930 seconds TIMER [Load input / Distribute input data] 1 minutes, 50.433594 seconds TIMER [Load input] 2 minutes, 33.387527 seconds TIMER [Build assembly graph / Distribute vertices] 32 minutes, 27.249634 seconds
MemoryUtilization Checksum GoodComments BadComments NeutralComments Not required because spate-2014-11-19-02-24-12-3 worked out just fine
both jobs are still queued.
Use memory cache instead of memory pool for:
[boisvert@bigmem biosal]$ grep memory_pool:allocate spate-007.00002.txt | wc -l 103491 [boisvert@bigmem biosal]$ grep biosal_dna_kmer_init_copy spate-007.00002.txt | wc -l 101378
Trace:
[boisvert@bigmem biosal]$ grep memory_pool:allocate spate-007.00003.txt | wc -l 9224967 [boisvert@bigmem biosal]$ grep memory_pool:free spate-007.00003.txt | wc -l 9224967
tickets: #824 #825
JobName Goal
Machine AllocationStatus Path Commit c6a5404833ad5417835b49c4eb29de356e269f99
Toolchain Script Submission [boisvert@cetuslac1 biosal]$ tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-22-00-25-06 (c6a5404833ad5417835b49c4eb29de356e269f99) 368361
MachineUtilization ComputationLoad RunningTime [boisvert@cetuslac1 automated-tests]$ grep TIMER spate-2014-11-22-00-25-06.output TIMER [Load input / Count input data] 38.569576 seconds TIMER [Load input / Distribute input data] 1 minutes, 41.074715 seconds TIMER [Load input] 2 minutes, 19.644287 seconds
MemoryUtilization Checksum GoodComments BadComments Error, node/769 received signal SIGSEGV
no stack...
2014-11-22 01:52:42.439 (WARN ) [0x4000122bde0] CET-40000-73371-1024:884918:ibm.runjob.client.Job: terminated by signal 11 2014-11-22 01:52:42.439 (WARN ) [0x4000122bde0] CET-40000-73371-1024:884918:ibm.runjob.client.Job: abnormal termination by signal 11 from rank 769 2014-11-22 01:52:42.439 (WARN ) [0x4000122bde0] CET-40000-73371-1024:884918:ibm.runjob.client.Job: 5 INFO RAS events 2014-11-22 01:52:42.439 (WARN ) [0x4000122bde0] CET-40000-73371-1024:884918:ibm.runjob.client.Job: most recent RAS event text: DDR Correctable Error Summary : count=10000 MCFIR error status: [MEMORY_CE] This bit is set when a memory CE is detected on a non-maintenance memory read op; 2014-11-22 01:52:42.439 (WARN ) [0x4000122bde0] CET-40000-73371-1024:884918:ibm.runjob.client.Job: with severity INFO 2014-11-22 01:52:42.440 (INFO ) [0x4000122bde0] tatu.runjob.client: task terminated by signal 11
NeutralComments
JobName Goal Machine AllocationStatus Path Commit b1d6993a89dfb11cfb7856cfb5a7419ab0b7a5d3
Toolchain Script Submission [boisvert@cetuslac1 biosal]$ tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-24-02-09-03 (b1d6993a89dfb11cfb7856cfb5a7419ab0b7a5d3) 369199
MachineUtilization ComputationLoad RunningTime [boisvert@cetuslac1 automated-tests]$ grep TIMER spate-2014-11-24-02-09-03.output TIMER [Load input / Count input data] 41.526814 seconds TIMER [Load input / Distribute input data] 1 minutes, 42.540985 seconds TIMER [Load input] 2 minutes, 24.067795 seconds TIMER [Build assembly graph / Distribute vertices] 9 minutes, 41.330444 secondscore_manager/1055741 dies TIMER [Build assembly graph / Distribute arcs] 21 minutes, 54.272461 seconds TIMER [Build assembly graph] 31 minutes, 35.602905 seconds
Mon Nov 24 04:58:31 2014 +0000 (UTC) Info: maximum execution time exceeded; initiating job termination
MemoryUtilization
Checksum GoodComments [boisvert@cetuslac1 automated-tests]$ grep GRAPH spate-2014-11-24-02-09-03.output GRAPH -> 148043434448 vertices, 324996128800 vertex observations, and 130889809559 arcs.
BadComments NeutralComments
Result:
the mallopt thing provided by Hal fixed somehow the issue.
There is still some memory issue, but it is not running out of memory.
A
http://www.idris.fr/eng/turing/turing-fragmentation_memoire-eng.html