GeneAssembly / biosal

biosal is a distributed BIOlogical Sequence Actor Library. THIS IS A MIRROR.
BSD 2-Clause "Simplified" License
6 stars 1 forks source link

memory: reduce memory fragmentation on Blue Gene/Q #819

Closed sebhtml closed 10 years ago

sebhtml commented 10 years ago

http://www.idris.fr/eng/turing/turing-fragmentation_memoire-eng.html

sebhtml commented 10 years ago

with CORE_MEMORY_DEBUG

sebhtml commented 10 years ago

Iteration 1

[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 129995

[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 98142 0xc170626e 15292 0x9739a8fa 3824 0x146f7d15 3317 0x2d9d481 3225 0x37ddf367 2800 0x185945f7 1008 0x89e9235d 803 0x46d316e4 280 0x1d4f2792 224 0x78e238cd

sebhtml commented 10 years ago

Iteration 2

[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 119510 [boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 101438 0xc170626e 4553 0x146f7d15 3315 0x2d9d481 3223 0x37ddf367 2800 0x185945f7 1008 0x89e9235d 766 0x46d316e4 532 0x1d4f2792 419 0x9739a8fa 395 0x2ee1c5a6

core/system/memory_pool.c:#define MEMORY_MEMORY_POOL 0xc170626e

sebhtml commented 10 years ago

Iteration 3

sebhtml commented 10 years ago

[boisvert@bigmem biosal]$ grep 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 36438 8 15380 16 13205 32 10272 64 8621 24 3910 1024 3563 4608 3011 128 2120 48 1979 256

sebhtml commented 10 years ago

[boisvert@bigmem biosal]$ grep worker_run log|wc -l
36052

sebhtml commented 10 years ago

[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 127318 0x28fc4a42 127316 0x771872d8 39476 0xc170626e 4558 0x146f7d15 3314 0x2d9d481 3222 0x37ddf367 2800 0x185945f7 1008 0x89e9235d 764 0x46d316e4 532 0x1d4f2792

sebhtml commented 10 years ago

[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 87830

[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 69743 0xc170626e 4559 0x146f7d15 3315 0x2d9d481 3223 0x37ddf367 2800 0x185945f7 1008 0x89e9235d 764 0x46d316e4 532 0x1d4f2792 419 0x9739a8fa 395 0x2ee1c5a6

by size:

[boisvert@bigmem biosal]$ grep 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 30109 16368 16283 8 8607 24 3688 1024 3564 3584 2009 48 1051 4 625 16 561 32 447 512

sebhtml commented 10 years ago

[boisvert@miralac1 automated-tests]$ addr2line -e spate-2014-11-15-04-07-25.spate < spate-2014-11-15-04-07-25.stack /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/tracer.c:42 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory.c:135 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory_pool.c:161 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/structures/simple_queue.c:174 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory_pool.c:546 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/genomics/data/dna_kmer.c:47 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/genomics/assembly/assembly_arc.c:27 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/genomics/assembly/assembly_arc_block.c:224 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/genomics/assembly/assembly_arc_classifier.c:132 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/actor.c:257 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/node.c:1754 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:1927 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:1685 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:713 /bgsys/drivers/V1R2M2/ppc64/toolchain/gnu/glibc-2.12.2/nptl/pthread_create.c:322 :0

sebhtml commented 10 years ago

69743 0xc170626e 30109 16368

This one can be removed by using the segments directly to store a linked list.

sebhtml commented 10 years ago

[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 60478 [boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 42737 0xc170626e 4542 0x146f7d15 3315 0x2d9d481 3223 0x37ddf367 2800 0x185945f7 1008 0x89e9235d 758 0x46d316e4 523 0x1d4f2792 419 0x9739a8fa 395 0x2ee1c5a6 [boisvert@bigmem biosal]$ grep 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 17117 8 9176 32 3992 64 3904 2048 3724 1024 1884 4 633 16 455 128 449 512 367 32768

sebhtml commented 10 years ago

irb(main):002:0> 100.0 * (60478 - 87830) / 87830 => -31.14197882272572

sebhtml commented 10 years ago

Distribution by size, all keys

[boisvert@bigmem biosal]$ grep 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 15669 8 9169 32 3901 2048 3724 1024 1114 64 627 16 469 128 447 512 436 4 369 32768

[boisvert@bigmem biosal]$ grep -v 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 4574 16 3248 131072 3223 24576 2800 2480 947 1048576 521 2097152 336 24 291 4194304 283 16777216 224 2768

Key 0xc170626e and size 8

[boisvert@bigmem biosal]$ grep ^# log|awk '{print $2}'|sort|uniq -c|sort -r -n|head -n 20 32169 ./applications/spate_metagenome_assembler/spate() 16737 ./applications/spate_metagenome_assembler/spate(core_hash_table_add+0xcd) 16402 ./applications/spate_metagenome_assembler/spate(core_tracer_print_stack_backtrace+0x24) 16397 ./applications/spate_metagenome_assembler/spate(core_memory_pool_allocate+0x261) 16397 ./applications/spate_metagenome_assembler/spate(core_memory_allocate_private+0xbe) 16156 /lib64/libpthread.so.0() 16156 /lib64/libc.so.6(clone+0x6d) 16155 ./applications/spate_metagenome_assembler/spate(thorium_worker_main+0x23) 14601 ./applications/spate_metagenome_assembler/spate(thorium_worker_run+0x22b) 14601 ./applications/spate_metagenome_assembler/spate(thorium_actor_work+0xc8) SUSPICIOUS 7609 ./applications/spate_metagenome_assembler/spate(core_hash_table_group_init+0x48) SUSPICIOUS 7608 ./applications/spate_metagenome_assembler/spate(core_hash_table_group_init+0xa3) 7254 ./applications/spate_metagenome_assembler/spate(core_map_add_value+0x4b) 7190 ./applications/spate_metagenome_assembler/spate(core_memory_pool_allocate+0x20c) 7125 ./applications/spate_metagenome_assembler/spate(core_memory_pool_free_private+0xfd) 7125 ./applications/spate_metagenome_assembler/spate(core_memory_pool_free+0x12d) SUSPICIOUS 6046 ./applications/spate_metagenome_assembler/spate(biosal_dna_kmer_destroy+0x14) 5600 ./applications/spate_metagenome_assembler/spate(biosal_vertex_neighborhood_receive+0xe6) SUSPICIOUS 5599 ./applications/spate_metagenome_assembler/spate(core_vector_pack_unpack+0xcc) SUSPÎCIOUS** 5598 ./applications/spate_metagenome_assembler/spate(biosal_unitig_visitor_receive+0x23b)

sebhtml commented 10 years ago

[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 39965 [boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 22062 0xc170626e 4622 0x146f7d15 3315 0x2d9d481 3223 0x37ddf367 2799 0x185945f7 1008 0x89e9235d 766 0x46d316e4 588 0x1d4f2792 419 0x9739a8fa 395 0x2ee1c5a6 [boisvert@bigmem biosal]$ grep 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 9192 8 4340 32 3628 2048 2294 64 1152 4 389 512 368 128 278 32768 193 16 72 256

sebhtml commented 10 years ago

[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 35735

sebhtml commented 10 years ago

irb(main):001:0> (35735.0-39965)/39965*100 => -10.584261228575004

sebhtml commented 10 years ago

Reduction since start: irb(main):009:0> 100.0*(35735-129995)/129995 => -72.51048117235278

-72%

sebhtml commented 10 years ago

latency_probe

[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 8610 [boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 3408 0xc170626e 2508 0x2d9d481 2428 0x37ddf367 119 0x146f7d15 39 0x2c6a06d3 24 0x97b478ba 24 0x46d316e4 24 0x323f94b8 24 0x2ee1c5a6 4 0xf3ad5880 [boisvert@bigmem biosal]$ grep 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 2406 128 290 8 258 32 109 32768 109 2048 78 16 53 256 33 512 33 1024 14 4

sebhtml commented 10 years ago

spate

[boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|wc -l 35758 [boisvert@bigmem biosal]$ grep memory_allocate log|awk '{print $10}'|sort|uniq -c|sort -r -n|head 17790 0xc170626e 4655 0x146f7d15 3315 0x2d9d481 3223 0x37ddf367 2800 0x185945f7 1008 0x89e9235d 766 0x46d316e4 588 0x1d4f2792 419 0x9739a8fa 395 0x2ee1c5a6 [boisvert@bigmem biosal]$ grep 0xc170626e log|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 8164 8 4376 32 3664 2048 389 512 369 128 277 32768 185 16 82 64 72 256 70 1024

sebhtml commented 10 years ago

Between vertices processing and arc processing:

[boisvert@bigmem biosal]$ cat log9|grep allocate | awk '{print $3}'|sort -n|uniq -c|sort -r -n|head 857 24 817 8 578 1048576 414 32 291 4194304 286 512 283 16777216 136 2097152 117 2048 110 32768

sebhtml commented 10 years ago

JobName Goal Trace memory allocations, one stdout file per node.

Machine AllocationStatus Path /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests

Commit added CORE_DEBUG_MEMORY [boisvert@cetuslac1 biosal]$ ./scripts/IBM_Blue_Gene_Q/build-spate-xl.sh

Toolchain Script added -freopen-stdout

[boisvert@cetuslac1 biosal]$ cat /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/spate-2014-11-19-02-24-12.sh

!/bin/bash

echo "Commit= c2ac9bd52207e565955fe93185f7bce21725c444"

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 01:00:00 \ -O spate-2014-11-19-02-24-12 \ --mode c1 \ spate-2014-11-19-02-24-12.spate -print-load -threads-per-node 16 \ -k 27 Iowa_Continuous_Corn/*.fastq \ -o spate-2014-11-19-02-24-12 -freopen-stdout

Submission [boisvert@cetuslac1 biosal]$ tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-19-02-24-12 (c2ac9bd52207e565955fe93185f7bce21725c444)

[boisvert@cetuslac1 biosal]$ cp applications/spate_metagenome_assembler/spate /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/spate-2014-11-19-02-24-12.spate

[boisvert@cetuslac1 automated-tests]$ ./spate-2014-11-19-02-24-12.sh 366861

MachineUtilization ComputationLoad RunningTime ran for like 30 minutes, then:

[boisvert@cetuslac1 automated-tests]$ tail spate-2014-11-19-02-24-12.cobaltlog Wed Nov 19 02:30:06 2014 +0000 (UTC) submitted with cwd set to: /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests Wed Nov 19 02:39:20 2014 +0000 (UTC) boisvert/366861: Initiating boot at location CET-40000-73371-1024. Wed Nov 19 02:45:49 2014 +0000 (UTC) boisvert/366861: Block CET-40000-73371-1024 free, retrying boot. Wed Nov 19 02:45:50 2014 +0000 (UTC) boisvert/366861: Initiating boot at location CET-40000-73371-1024. Wed Nov 19 02:52:22 2014 +0000 (UTC) boisvert/366861: Block CET-40000-73371-1024 free, retrying boot. Wed Nov 19 02:52:23 2014 +0000 (UTC) boisvert/366861: Initiating boot at location CET-40000-73371-1024. Wed Nov 19 02:58:59 2014 +0000 (UTC) boisvert/366861: Block CET-40000-73371-1024 free, retrying boot. Wed Nov 19 02:59:00 2014 +0000 (UTC) boisvert/366861: Initiating boot at location CET-40000-73371-1024. Wed Nov 19 03:05:32 2014 +0000 (UTC) boisvert/366861: Boot terminated on CET-40000-73371-1024: too many reboot attempts. Wed Nov 19 03:05:39 2014 +0000 (UTC) Info: task completed normally with an exit code of 1234567; initiating job cleanup and removal

[boisvert@cetuslac1 automated-tests]$ ls|grep spate-2014-11-19-02-24-12 spate-2014-11-19-02-24-12.cobaltlog spate-2014-11-19-02-24-12.job spate-2014-11-19-02-24-12.sh spate-2014-11-19-02-24-12.spate

MemoryUtilization Checksum GoodComments BadComments NeutralComments

Note 1 After 17 minutes, there is still nothing written (no output directory and no stdout log files)

[boisvert@cetuslac1 automated-tests]$ qstat -f -u boisvert JobID JobName User Score WallTime QueuedTime RunTime Nodes State Location Mode Procs Queue StartTime

366861 spate-2014-11-19-02-24-12 boisvert 0.7 01:00:00 00:08:44 00:17:10 1024 running CET-40000-73371-1024 c1 1024 default Wed Nov 19 02:38:50 2014 +0000 (UTC) [boisvert@cetuslac1 automated-tests]$ ls|grep spate-2014-11-19-02-24-12 spate-2014-11-19-02-24-12.cobaltlog spate-2014-11-19-02-24-12.job spate-2014-11-19-02-24-12.sh spate-2014-11-19-02-24-12.spate

sebhtml commented 10 years ago

JobName Goal understand memory fragmentation with tracepoints, also I hope that this one does not hit the reboot issue.

Machine AllocationStatus Path Commit Toolchain Script [boisvert@cetuslac1 automated-tests]$ cat spate-2014-11-19-02-24-12-2.sh

!/bin/bash

echo "Commit= c2ac9bd52207e565955fe93185f7bce21725c444"

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 01:00:00 \ -O spate-2014-11-19-02-24-12-2 \ --mode c1 \ spate-2014-11-19-02-24-12-2.spate -print-load -threads-per-node 16 \ -k 27 Iowa_Continuous_Corn/*.fastq \ -o spate-2014-11-19-02-24-12-2 -freopen-stdout

Submission [boisvert@cetuslac1 automated-tests]$ ./spate-2014-11-19-02-24-12-2.sh 366890

MachineUtilization ComputationLoad RunningTime

[boisvert@cetuslac1 automated-tests]$ grep TIMER spate-2014-11-19-02-24-12-2.* spate-2014-11-19-02-24-12-2.01023.txt:TIMER [Load input / Count input data] 29.850824 seconds Binary file spate-2014-11-19-02-24-12-2.spate matches

All but one log file is empty:

[boisvert@cetuslac1 automated-tests]$ ls -lh|grep spate-2014-11-19-02-24-12-2|head
-rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00000.txt -rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00001.txt -rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00002.txt -rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00003.txt -rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00004.txt -rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00005.txt -rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00006.txt -rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00007.txt -rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00008.txt -rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.00009.txt [boisvert@cetuslac1 automated-tests]$ ls -lh|grep spate-2014-11-19-02-24-12-2|tail -rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.01019.txt -rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.01020.txt -rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.01021.txt -rw-r--r-- 1 boisvert CompBIO 0 Nov 19 04:09 spate-2014-11-19-02-24-12-2.01022.txt -rw-r--r-- 1 boisvert CompBIO 1.4M Nov 19 04:10 spate-2014-11-19-02-24-12-2.01023.txt -rw-r--r-- 1 boisvert CompBIO 7.3K Nov 19 05:08 spate-2014-11-19-02-24-12-2.cobaltlog -rw-r--r-- 1 boisvert CompBIO 925 Nov 19 04:09 spate-2014-11-19-02-24-12-2.error -rw-r--r-- 1 boisvert CompBIO 3.4M Nov 19 04:09 spate-2014-11-19-02-24-12-2.output -rwxr-xr-x 1 boisvert CompBIO 365 Nov 19 03:55 spate-2014-11-19-02-24-12-2.sh -rwxr-xr-x 1 boisvert CompBIO 58M Nov 19 03:55 spate-2014-11-19-02-24-12-2.spate

MemoryUtilization Checksum GoodComments waiting for [alcf-support #240169] => ALCF can not help becauase there is not enough information in logs.

BadComments NeutralComments

sebhtml commented 10 years ago

JobName Goal Try the freopen code path on Mira

Machine Mira

AllocationStatus Path Commit Toolchain Script [boisvert@cetuslac1 automated-tests]$ cat spate-2014-11-19-02-24-12-3.sh

!/bin/bash

echo "Commit= c2ac9bd52207e565955fe93185f7bce21725c444"

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 01:00:00 \ -O spate-2014-11-19-02-24-12-3 \ --mode c1 \ spate-2014-11-19-02-24-12-3.spate -print-load -threads-per-node 16 \ -k 27 Iowa_Continuous_Corn/*.fastq \ -o spate-2014-11-19-02-24-12-3 -freopen-stdout

Submission [boisvert@miralac1 automated-tests]$ ./spate-2014-11-19-02-24-12-3.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 367136

MachineUtilization ComputationLoad RunningTime

[boisvert@cetuslac1 automated-tests]$ grep TIMER spate-2014-11-19-02-24-12-3.* spate-2014-11-19-02-24-12-3.01023.txt:TIMER [Load input / Count input data] 38.665909 seconds spate-2014-11-19-02-24-12-3.01023.txt:TIMER [Load input / Distribute input data] 1 minutes, 29.921494 seconds spate-2014-11-19-02-24-12-3.01023.txt:TIMER [Load input] 2 minutes, 8.587402 seconds spate-2014-11-19-02-24-12-3.01021.txt:TIMER [Build assembly graph / Distribute vertices] 8 minutes, 48.348511 seconds

spate-2014-11-19-02-24-12-4 took 32 minutes for the vertices.

MemoryUtilization [boisvert@cetuslac1 automated-tests]$ grep "TRACEPOINT memory:allocate" spate-2014-11-19-02-24-12-3.01020.txt|awk '{print $4}'|sort -n|uniq -c|sort -r -n|head 7365 24 <================ 5615 1048576 2302 2097152 1172 32768 395 8 309 32 294 8388608 252 420 128 131072 117 2048

Here are the API keys that are used to allocate 24 bytes:

[boisvert@cetuslac1 automated-tests]$ grep "memory:allocate" spate-2014-11-19-02-24-12-3.01020.txt|grep "size= 24"|awk '{print $14}'|sort|uniq -c|sort -r -n|head 4103 0x1d4f2792 1497 0xa156d064 740 0x46d316e4 540 0xee1344f0 270 0x89e9235d 95 0x2d9d481 94 0x9739a8fa 75 0x2ee1c5a6 15 0x97b478ba 15 0x84a83916

[boisvert@bigmem biosal]$ grep 0x1d4f2792 * -R genomics/assembly/assembly_arc_classifier.c:#define MEMORY_NAME_ARC_CLASSIFIER 0x1d4f2792

[boisvert@bigmem biosal]$ grep 0xa156d064 * -R genomics/assembly/assembly_block_classifier.c:#define MEMORY_POOL_NAME_VERTEX_CLASSIFIER 0xa156d064

Checksum GoodComments spate-2014-11-19-02-24-12-3.01020.txt contains tracepoint events.

BadComments NeutralComments

Traces:

spate-2014-11-19-02-24-12-3.01020.txt-0xa156d064

sebhtml commented 10 years ago

JobName Goal Try without freopen on Cetus

Machine AllocationStatus Path Commit Toolchain Script [boisvert@cetuslac1 automated-tests]$ cat spate-2014-11-19-02-24-12-4.sh

!/bin/bash

echo "Commit= c2ac9bd52207e565955fe93185f7bce21725c444"

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 01:00:00 \ -O spate-2014-11-19-02-24-12-4 \ --mode c1 \ spate-2014-11-19-02-24-12-4.spate -print-load -threads-per-node 16 \ -k 27 Iowa_Continuous_Corn/*.fastq \ -o spate-2014-11-19-02-24-12-4

Submission [boisvert@cetuslac1 automated-tests]$ ./spate-2014-11-19-02-24-12-4.sh 367151

MachineUtilization ComputationLoad RunningTime [boisvert@cetuslac1 automated-tests]$ grep TIMER spate-2014-11-19-02-24-12-4.output TIMER [Load input / Count input data] 42.953930 seconds TIMER [Load input / Distribute input data] 1 minutes, 50.433594 seconds TIMER [Load input] 2 minutes, 33.387527 seconds TIMER [Build assembly graph / Distribute vertices] 32 minutes, 27.249634 seconds

MemoryUtilization Checksum GoodComments BadComments NeutralComments Not required because spate-2014-11-19-02-24-12-3 worked out just fine

sebhtml commented 10 years ago

both jobs are still queued.

sebhtml commented 10 years ago

Use memory cache instead of memory pool for:

[boisvert@bigmem biosal]$ grep memory_pool:allocate spate-007.00002.txt | wc -l 103491 [boisvert@bigmem biosal]$ grep biosal_dna_kmer_init_copy spate-007.00002.txt | wc -l 101378

Trace:

0 ./applications/spate_metagenome_assembler/spate(core_tracer_print_stack_backtrace+0x24) [0x448634]

1 ./applications/spate_metagenome_assembler/spate(core_memory_pool_allocate+0x187) [0x449037]

2 ./applications/spate_metagenome_assembler/spate(biosal_dna_kmer_init_copy+0x39) [0x441369]

3 ./applications/spate_metagenome_assembler/spate(biosal_assembly_arc_block_add_arc_copy+0x23) [0x437473]

4 ./applications/spate_metagenome_assembler/spate(biosal_assembly_arc_classifier_push_arc_block+0x130) [0x437290]

5 ./applications/spate_metagenome_assembler/spate(thorium_dispatcher_dispatch+0x4e) [0x423b7e]

6 ./applications/spate_metagenome_assembler/spate(biosal_assembly_arc_classifier_receive+0x23) [0x436e73]

7 ./applications/spate_metagenome_assembler/spate() [0x41f800]

8 ./applications/spate_metagenome_assembler/spate(thorium_actor_work+0xc8) [0x41f9e8]

9 ./applications/spate_metagenome_assembler/spate(thorium_worker_run+0x22b) [0x42213b]

10 ./applications/spate_metagenome_assembler/spate(thorium_worker_main+0x23) [0x422563]

sebhtml commented 10 years ago

823

sebhtml commented 10 years ago

[boisvert@bigmem biosal]$ grep memory_pool:allocate spate-007.00003.txt | wc -l 9224967 [boisvert@bigmem biosal]$ grep memory_pool:free spate-007.00003.txt | wc -l 9224967

sebhtml commented 10 years ago

tickets: #824 #825

sebhtml commented 10 years ago

JobName Goal

Machine AllocationStatus Path Commit c6a5404833ad5417835b49c4eb29de356e269f99

Toolchain Script Submission [boisvert@cetuslac1 biosal]$ tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-22-00-25-06 (c6a5404833ad5417835b49c4eb29de356e269f99) 368361

MachineUtilization ComputationLoad RunningTime [boisvert@cetuslac1 automated-tests]$ grep TIMER spate-2014-11-22-00-25-06.output TIMER [Load input / Count input data] 38.569576 seconds TIMER [Load input / Distribute input data] 1 minutes, 41.074715 seconds TIMER [Load input] 2 minutes, 19.644287 seconds

MemoryUtilization Checksum GoodComments BadComments Error, node/769 received signal SIGSEGV

no stack...

2014-11-22 01:52:42.439 (WARN ) [0x4000122bde0] CET-40000-73371-1024:884918:ibm.runjob.client.Job: terminated by signal 11 2014-11-22 01:52:42.439 (WARN ) [0x4000122bde0] CET-40000-73371-1024:884918:ibm.runjob.client.Job: abnormal termination by signal 11 from rank 769 2014-11-22 01:52:42.439 (WARN ) [0x4000122bde0] CET-40000-73371-1024:884918:ibm.runjob.client.Job: 5 INFO RAS events 2014-11-22 01:52:42.439 (WARN ) [0x4000122bde0] CET-40000-73371-1024:884918:ibm.runjob.client.Job: most recent RAS event text: DDR Correctable Error Summary : count=10000 MCFIR error status: [MEMORY_CE] This bit is set when a memory CE is detected on a non-maintenance memory read op; 2014-11-22 01:52:42.439 (WARN ) [0x4000122bde0] CET-40000-73371-1024:884918:ibm.runjob.client.Job: with severity INFO 2014-11-22 01:52:42.440 (INFO ) [0x4000122bde0] tatu.runjob.client: task terminated by signal 11

NeutralComments

sebhtml commented 10 years ago

JobName Goal Machine AllocationStatus Path Commit b1d6993a89dfb11cfb7856cfb5a7419ab0b7a5d3

Toolchain Script Submission [boisvert@cetuslac1 biosal]$ tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-24-02-09-03 (b1d6993a89dfb11cfb7856cfb5a7419ab0b7a5d3) 369199

MachineUtilization ComputationLoad RunningTime [boisvert@cetuslac1 automated-tests]$ grep TIMER spate-2014-11-24-02-09-03.output TIMER [Load input / Count input data] 41.526814 seconds TIMER [Load input / Distribute input data] 1 minutes, 42.540985 seconds TIMER [Load input] 2 minutes, 24.067795 seconds TIMER [Build assembly graph / Distribute vertices] 9 minutes, 41.330444 secondscore_manager/1055741 dies TIMER [Build assembly graph / Distribute arcs] 21 minutes, 54.272461 seconds TIMER [Build assembly graph] 31 minutes, 35.602905 seconds

Mon Nov 24 04:58:31 2014 +0000 (UTC) Info: maximum execution time exceeded; initiating job termination

MemoryUtilization

Checksum GoodComments [boisvert@cetuslac1 automated-tests]$ grep GRAPH spate-2014-11-24-02-09-03.output GRAPH -> 148043434448 vertices, 324996128800 vertex observations, and 130889809559 arcs.

BadComments NeutralComments

sebhtml commented 10 years ago

Result:

the mallopt thing provided by Hal fixed somehow the issue.

There is still some memory issue, but it is not running out of memory.

sebhtml commented 10 years ago

A