GeneAssembly / biosal

biosal is a distributed BIOlogical Sequence Actor Library. THIS IS A MIRROR.
BSD 2-Clause "Simplified" License
6 stars 1 forks source link

perform graph traversal on Beagle #788

Closed sebhtml closed 9 years ago

sebhtml commented 9 years ago

Submitted build spate-2014-11-05-17-12-23 (07d6f14edf312e8ca6f371ae4a8e5172c758aac6) 2905036.sdb

Beagle) grep TIMER spate-2014-11-05-17-12-23.stdout TIMER [Load input / Count input data] 3 minutes, 41.167130 seconds TIMER [Load input / Distribute input data] 3 minutes, 34.626663 seconds TIMER [Load input] 7 minutes, 15.793793 seconds TIMER [Build assembly graph / Distribute vertices] 14 minutes, 10.378845 seconds

Beagle) tail -n1 spate-2014-11-05-17-12-23.e2905036 [NID 00713] 2014-11-05 17:38:33 Apid 4997901: OOM killer terminated this process.

sebhtml commented 9 years ago

Last good one (#751):

Beagle) grep TIMER spate-2014-11-02-12-27-34.stdout (082485e351f78990c3c13a7c3f2a2df84a3d7856) TIMER [Load input / Count input data] 3 minutes, 37.046280 seconds TIMER [Load input / Distribute input data] 3 minutes, 21.185928 seconds TIMER [Load input] 6 minutes, 58.232208 seconds TIMER [Build assembly graph / Distribute vertices] 5 minutes, 48.143402 seconds TIMER [Build assembly graph / Distribute arcs] 11 minutes, 2.775940 seconds TIMER [Build assembly graph] 16 minutes, 50.919373 seconds

Beagle) head -n3 spate-2014-11-02-12-27-34.stdout thorium_transport: type mpi1_pt2pt_transport thorium_scheduler: type cfs_scheduler thorium_message_multiplexer: disabled=0 buffer_size_in_bytes=1024 timeout_in_nanoseconds=200000

Changes between 082485e351f78990c3c13a7c3f2a2df84a3d7856 and 07d6f14:

FAILED SCHEDULED 748e42d cray: don't enable CONFIG_DEBUG by default 5a0882a core: examine memory pool for leak 903ee38 latency_probe: use a random start and round-robin selection 8fb4156 latency_probe: display number of send and received messages at the end 9e1e51e latency_probe: use source actor script 031a2dd latency_probe: add source actor script 575f3bb latency_probe: add target actors fa78911 latency_probe: add target actor script eb99e13 thorium_worker: simplify random seed a1380de thorium_node: fix regression added in 47a651c857 91d2aff thorium_worker: use a better random seed a0f27c7 thorium: fix regression added in 89300f4ee regarding random numbers 89300f4 thorium: use rand_r() inside worker whenever possible e9b2e9b ring: always use a memory fence before incrementing the tail 9daab5d thorium_worker: push message directly in ring when possible 1ee0739 thorium_worker: put message directly in outbound message ring f9fb696 thorium: rename function to thorium_worker_send_local_delivery 077e59e scripts: don't search for action specifiers defined with a BASE 389ce63 thorium_worker: add function to send message to other nodes 5817252 thorium_worker: add function to send with multiplexer e2bf6c7 performance: limit the number of targets to reduce memory usage 47a651c thorium: assign an actor to a worker when it is spawned 5680d06 thorium: add a function to get the worker name of an actor 655a0be thorium: call rand_r() in actor instead of in worker 6fe3f02 Merge branch 'energy' of github.com:sebhtml/biosal into energy 9621e0d tests: use CORE_DEBUGGER_ASSERT_ENABLED for complex assertions bfc1d16 tests: don't use rand() in tests, use rand_r() instead 3ff5509 performance: avoid the function rand() in receive() 4cbbb4b examples: don't use rand() in actors de2c30c genomics: don't use rand() inside actors 2b7a5b7 thorium: don't use rand() to avoid system calls (sys_futex) FAILED SCHEDULED 07d6f14 tests: fix Beagle launch script for Spate 9006075 tracing: fix LTTng option SUSPICIOUS 8e0448d thorium: change action naming to follow Minix-style convention f7ca988 performance: avoid having too many I/O syscalls c060534 latency_probe: display number of reply messages 87ecc97 thorium: tests show that Iprobe/Recv is better than Irecv/Test 1a32240 thorium: don't disable the multiplexer on BGQ 2c7df75 tests: add prefix in copy command for the build artifact f53e34a tests: use good executable for latency_probe test f36d2f0 core: fix compilation error on Blue Gene/Q 07353e8 core: POWER7 has a relaxed (weak) memory model SUSPICIOUS 33acaeb core: make core_memory_fence() not 'static inline' for profiling 525603c core: simplify fast_ring interface 6942260 core: move spinlock directly in the critical section function b0b038e performance: fix option in latency_probe 46fde8e scripts: the argument -I. is not needed anymore 898c053 core: use CONFIG_DEBUG and not THORIUM_DEBUG 5b74c66 tests: add launch script for latency_probe on Blue Gene/Q 75a0d13 documentation: improve transport interface documentation 7b2b66a thorium: don't use a aloop in mpi1_pt2pt_transport_test dbbb6aa thorium_transport: remove outdated constant 723785e thorium_transport: fix documentation of mock_transport b4a407d mpi1_pt2pt_nonblocking: don't use a loop in test() 3545aa3 thorium_transport: use PT2PT and not P2P 9b5a030 thorium: don't use a loop in mpi1_pt2pt_nonblocking receive() dfb3879 thorium: set mpi1_pt2pt_nonblocking as default transport subsystem 131e947 documentation: improve compilation option documentation fdc1296 documentation: fix compilation option table d981e81 documentation: fix error in readme 89b2a93 documentation: add more compilation options 5fd62d4 documentation: add compilation option in documentation 0330965 thorium: fix memory leak in the demultiplexing code path 80d7c1a build: don't use MPI on Xeon Phi for tests 5c3a601 performance: also include ACTION_PING_REPLY messages in results c541b41 intel: add script to build on Intel Xeon Phi 7120A FAILED SCHEDULED 2d04562 build: add option CONFIG_CLOCK_GETTIME=n 15e03f5 build: fix minimal build 7c2dda5 build: improve the option CONFIG_LTTNG c190bdf build: change THORIUM_DEBUG to CONFIG_DEBUG 2a0ef48 thorium_transport: add fallback transport that does nothing (mock) 1f71cd3 thorium: add compilation option CONFIG_MPI=n to disable MPI bedee9b documentation: add documentation for CONFIG_LTTNG 07914ff build: add compilation option CONFIG_LTTNG e41c37c build: rename CONFIG_FLAGS to CONFIG_CFLAGS b6c3a1d build: add option to disable support for zlib (CONFIG_ZLIB=n) 9e04b23 thorium: add CONFIG_LDFLAGS for optional LDFLAGS 6685b2f core: add independent makefile for file storage 6d528c9 thorium: add independent makefile for schedulers 9031f41 thorium: use independent make file for transport ee14fd7 thorium: set default value for CONFIG_PAMI=n 7ee9129 performance: fix 4x7 script for latency_probe tests 907d236 thorium: remove empty lines in multiplexer FAILED SCHEDULED d5d3cec performance: don't show worker count in latency_probe f9c0ce4 thorium_node: show debug mode in output 3f8852e tests: add more performance tests with latency_probe SUSPICIOUS 5b3504b core: fix a bug in incremental resizing of hash table 766998c performance: add 2 script to measure throughput on multi-core ad37b83 engine: use rand_r (scalable) instead of rand (protected) d9f752a tests: add some tests in the test_map_delete suite 3d18728 tests: add a test for deleting from a map b948cf9 core: add assertions in memory pool to track a bug FAILED SCHEDULED 72336f9 thorium: print type of message in message_print c6730a1 core: verify if pointer is managed by pool if tracking is enabled

BLOCK_START FAILED SCHEDULED SUSPICIOUS 153eea9 thorium_node: use message type to select pool to free buffer PASSED SCHEDULED c55aa06 transport: disable nonblocking communication owing to a bug DIFFERENT_PROBLEM SCHEDULED 3cc35e1 transport: a MPI message can have 0 bytes (this is allowed) e463f70 transport: use nonblocking transport by default PASSED SCHEDULED 1cfbfc2 tests: generate unique test artifact for spate automated tests BLOCK_END

NOT_LIKELY 48be536 test: use unique executable artefact for latency_probe test NOT_LIKELY 02f617d test: fix the path to latency_probe NOT_LIKELY 5469856 tests: first generation for latency_probe on Beagle PASSED SCHEDULED 082485e thorium: make sure that transport requests don't accumulate

sebhtml commented 9 years ago

with new changes:

Submitted build spate-2014-11-06-13-37-43 (89300f4ee6d4cddcaad6c633b980620df32d4981) 2906385.sdb

result: deleted.

sebhtml commented 9 years ago

latency_probe with new changes:

Submitted build latency_probe-2014-11-06-13-50-55 (a0f27c733a825cc336c99ce8e3c970f140155267) 2906408.sdb Beagle) grep COUNTER latency_probe-2014-11-06-13-50-55.stdout PERFORMANCE_COUNTER type = ping-pong PERFORMANCE_COUNTER ping-action = ACTION_PING PERFORMANCE_COUNTER pong-action = ACTION_PING_REPLY PERFORMANCE_COUNTER node-count = 256 PERFORMANCE_COUNTER worker-count-per-node = 22 PERFORMANCE_COUNTER actor-count-per-worker = 100 PERFORMANCE_COUNTER worker-count = 5632 PERFORMANCE_COUNTER actor-count = 563200 PERFORMANCE_COUNTER ping-message-count-per-actor = 40000 PERFORMANCE_COUNTER ping-message-count = 22528000000 PERFORMANCE_COUNTER pong-message-count = 22528000000 PERFORMANCE_COUNTER message-count = 45056000000 PERFORMANCE_COUNTER elapsed-time = 852.583325 s PERFORMANCE_COUNTER computation-throughput = 52846447.563432 messages / s PERFORMANCE_COUNTER node-throughput = 206431.435795 messages / s PERFORMANCE_COUNTER worker-throughput = 9383.247082 messages / s PERFORMANCE_COUNTER worker-latency = 106572 ns PERFORMANCE_COUNTER actor-throughput = 93.832471 messages / s PERFORMANCE_COUNTER actor-latency = 10657291 ns

Submitted build latency_probe-2014-11-06-14-08-21 (a0f27c733a825cc336c99ce8e3c970f140155267) 2906410.sdb

Beagle) grep COUNTER latency_probe-2014-11-06-14-08-21.stdout PERFORMANCE_COUNTER type = ping-pong PERFORMANCE_COUNTER ping-action = ACTION_PING PERFORMANCE_COUNTER pong-action = ACTION_PING_REPLY PERFORMANCE_COUNTER node-count = 256 PERFORMANCE_COUNTER worker-count-per-node = 22 PERFORMANCE_COUNTER actor-count-per-worker = 100 PERFORMANCE_COUNTER worker-count = 5632 PERFORMANCE_COUNTER actor-count = 563200 PERFORMANCE_COUNTER ping-message-count-per-actor = 40000 PERFORMANCE_COUNTER ping-message-count = 22528000000 PERFORMANCE_COUNTER pong-message-count = 22528000000 PERFORMANCE_COUNTER message-count = 45056000000 PERFORMANCE_COUNTER elapsed-time = 859.137648 s PERFORMANCE_COUNTER computation-throughput = 52443284.386737 messages / s PERFORMANCE_COUNTER node-throughput = 204856.579636 messages / s PERFORMANCE_COUNTER worker-throughput = 9311.662711 messages / s PERFORMANCE_COUNTER worker-latency = 107392 ns PERFORMANCE_COUNTER actor-throughput = 93.116627 messages / s PERFORMANCE_COUNTER actor-latency = 10739220 ns

sebhtml commented 9 years ago

Beagle) ./tests/Beagle_Cray_XE6/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-06-21-17-41 (9e1e51e0dedcb45d65eb0afcaafa882680dcf0a1) 2906743.sdb

Result:

Beagle) grep TIMER spate-2014-11-06-21-17-41.stdout TIMER [Load input / Count input data] 4 minutes, 3.643066 seconds TIMER [Load input / Distribute input data] 4 minutes, 5.572739 seconds TIMER [Load input] 8 minutes, 9.215820 seconds TIMER [Build assembly graph / Distribute vertices] 13 minutes, 49.967224 seconds

sebhtml commented 9 years ago

with CONFIG_DEBUG=n:

Beagle) pwd /lustre/beagle/CompBIO/biosal-THOR/biosal Beagle) git diff Beagle) pwd /lustre/beagle/CompBIO/biosal-THOR/biosal Beagle) git diff|cat diff --git a/scripts/Cray_XE6/build-gnu.sh b/scripts/Cray_XE6/build-gnu.sh index 2eb64d4..796f9f9 100755 --- a/scripts/Cray_XE6/build-gnu.sh +++ b/scripts/Cray_XE6/build-gnu.sh @@ -23,6 +23,6 @@ module load xpmem/0.1-2.0402.44035.2.1.gem module load udreg/2.3.2-1.0402.7311.2.1.gem

make clean -make CC=cc -j 4 applications/argonnite_kmer_counter/argonnite CONFIG_DEBUG=y \ +make CC=cc -j 4 applications/argonnite_kmer_counter/argonnite \ applications/spate_metagenome_assembler/spate \ performance/latency_probe/latency_probe

Beagle) ./tests/Beagle_Cray_XE6/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-07-09-34-59 (5a0882af531a44830c9982d25ee672b9793c4788) 2907275.sdb

Result: this script pulls from github, so the patch did not do anything

sebhtml commented 9 years ago

without CONFIG_DEBUG

Submitted build spate-2014-11-07-09-58-19 (748e42d5e043963ce07b307548db81783058dd1f) 2907299.sdb

verification: Beagle) ./spate-2014-11-07-09-58-19.spate -transport mock_transport|grep CONFIG_DEBUG|wc -l 0

Beagle) grep TIMER spate-2014-11-07-09-58-19.stdout.2 TIMER [Load input / Count input data] 3 minutes, 39.050400 seconds TIMER [Load input / Distribute input data] 3 minutes, 30.601685 seconds TIMER [Load input] 7 minutes, 9.652100 seconds TIMER [Build assembly graph / Distribute vertices] 5 minutes, 21.558533 seconds

The running time was higher due to new assertions.

There were 2 problems:

memory usage:

Beagle) grep " ByteCount" spate-2014-11-07-09-58-19.stdout.2|tail thorium_node: node/167 METRICS AliveActorCount: 89 ByteCount: 50011672576 / 33877495808 thorium_node: node/86 METRICS AliveActorCount: 89 ByteCount: 49684054016 / 33877495808 thorium_node: node/153 METRICS AliveActorCount: 89 ByteCount: 48694865920 / 33877495808 thorium_node: node/140 METRICS AliveActorCount: 89 ByteCount: 49917792256 / 33877495808 thorium_node: node/247 METRICS AliveActorCount: 89 ByteCount: 49909428224 / 33877495808 thorium_node: node/248 METRICS AliveActorCount: 89 ByteCount: 50654879744 / 33877495808 thorium_node: node/244 METRICS AliveActorCount: 89 ByteCount: 49980608512 / 33877495808 thorium_node: node/74 METRICS AliveActorCount: 89 ByteCount: 50292686848 / 33877495808 thorium_node: node/150 METRICS AliveActorCount: 89 ByteCount: 50387423232 / 33877495808 thorium_node: node/12 METRICS AliveActorCount: 89 ByteCount: 50047279104 / 33877495808

sebhtml commented 9 years ago

Submitted build spate-2014-11-07-11-23-28 (2d04562) 2907618.sdb thorium_node: node/153 METRICS AliveActorCount: 89 ByteCount: 36071854080 / 33877495808

sebhtml commented 9 years ago

Beagle) export BUILD_COMMIT=d5d3cec Beagle) ./tests/Beagle_Cray_XE6/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-07-11-43-50 (d5d3cec) 2907621.sdb

Result: thorium_node: node/5 METRICS AliveActorCount: 89 ByteCount: 35510394880 / 33877495808

sebhtml commented 9 years ago

Beagle) export BUILD_COMMIT=72336f9 Beagle) ./tests/Beagle_Cray_XE6/launch-Spate-Iowa-Contiunous-Corn.sh

Submitted build spate-2014-11-07-12-11-32 (72336f9) 2907622.sdb thorium_node: node/247 METRICS AliveActorCount: 89 ByteCount: 41266630656 / 33877495808

sebhtml commented 9 years ago

Beagle) export BUILD_COMMIT=3cc35e1 Beagle) ./tests/Beagle_Cray_XE6/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-07-12-31-45 (3cc35e1) 2907639.sdb

OOM, before getting all vertices

sebhtml commented 9 years ago

Verify the last known version to work:

manual build of 082485e Beagle) qsub spate-2014-11-07-12-31-45.pbs 2907640.sdb

It is OK, waiting for TIMER to print time...

Beagle) grep TIMER spate-2014-11-07-12-31-45.stdout TIMER [Load input / Count input data] 4 minutes, 3.089340 seconds TIMER [Load input / Distribute input data] 3 minutes, 55.463501 seconds TIMER [Load input] 7 minutes, 58.552856 seconds TIMER [Build assembly graph / Distribute vertices] 5 minutes, 49.959473 seconds TIMER [Build assembly graph / Distribute arcs] 10 minutes, 43.182678 seconds TIMER [Build assembly graph] 16 minutes, 33.142151 seconds

sebhtml commented 9 years ago

manual build -> 1cfbfc2 build manually cp to build artifact path submit Beagle) qsub spate-2014-11-07-13-21-33.pbs 2907643.sdb

criterion: PASSED if arc extraction get passed 50%. (the memory leak brings down the whole thing around 80%).

result: Beagle) tail spate-2014-11-07-13-21-33.stdout -n1 sequence store 1006325 has 147518/393216 (0.38) entries left to produce Beagle) grep ByteCount spate-2014-11-07-13-21-33.stdout|tail -n1 thorium_node: node/15 METRICS AliveActorCount: 89 ByteCount: 22811369472 / 33877495808

sebhtml commented 9 years ago

manual submission (checkout, build, cp artifact) 153eea9 spate-2014-11-07-13-59-39

Beagle) qsub spate-2014-11-07-13-59-39.pbs 2907656.sdb

Beagle) grep ByteCount spate-2014-11-07-13-59-39.stdout |tail -n1
thorium_node: node/84 METRICS AliveActorCount: 89 ByteCount: 45073874944 / 33877495808

sebhtml commented 9 years ago

c55aa06 spate-2014-11-07-14-32-57 Beagle) qsub spate-2014-11-07-14-32-57.pbs 2907733.sdb

Beagle) grep TIMER spate-2014-11-07-14-32-57.stdout TIMER [Load input / Count input data] 4 minutes, 0.524765 seconds TIMER [Load input / Distribute input data] 4 minutes, 13.131958 seconds TIMER [Load input] 8 minutes, 13.656738 seconds TIMER [Build assembly graph / Distribute vertices] 5 minutes, 46.259430 seconds ^C Beagle) tail -n1 spate-2014-11-07-14-32-57.stdout sequence store 1004066 has 127780/393216 (0.32) entries left to produce

sebhtml commented 9 years ago

The regression is caused by

153eea9 thorium_node: use message type to select pool to free buffer

sebhtml commented 9 years ago

[boisvert@bigmem biosal]$ git show --stat 153eea9 commit 153eea9475de41fad944bb3cfccd233091f8d344 Author: Sébastien Boisvert boisvert@anl.gov Date: Sun Nov 2 21:38:27 2014 -0500

thorium_node: use message type to select pool to free buffer

Signed-off-by: Sébastien Boisvert <boisvert@anl.gov>

core/system/memory_pool.c | 30 +++++++++++++++++++++++++----- core/system/memory_pool.h | 2 +- engine/thorium/message.c | 8 +++++--- engine/thorium/node.c | 15 +++++++++++++-- performance/latency_probe/process.c | 1 - 5 files changed, 44 insertions(+), 12 deletions(-)

sebhtml commented 9 years ago

before closing:

sebhtml commented 9 years ago

JobName spate-2014-11-07-15-21-34

Goal Verify if patch fixes memory leak

Machine Beagle

AllocationStatus Path /lustre/beagle/CompBIO/automated-tests

Commit df97d4a745d8d2696fb786646a0d995c97b9942d

Toolchain Beagle) cc --version /opt/cray/xt-asyncpe/5.22/bin/cc: INFO: Compiling with CRAYPE_COMPILE_TARGET=native. gcc (GCC) 4.8.1 20130531 (Cray Inc.) Copyright (C) 2013 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Script Beagle) cat spate-2014-11-07-15-21-34.pbs

!/bin/bash

PBS -N spate-2014-11-07-15-21-34

PBS -A CI-DEB000002

PBS -l walltime=1:00:00

PBS -l mppwidth=6144

cd $PBS_O_WORKDIR export MPICH_NEMESIS_ASYNC_PROGRESS=1 export MPICH_MAX_THREAD_SAFETY=multiple export CRAY_MALLOPT_OFF=1

echo "Commit= df97d4a745d8d2696fb786646a0d995c97b9942d"

aprun -n 256 -N 1 -d 23 -r 1 \ spate-2014-11-07-15-21-34.spate -threads-per-node 23 -print-load \ -k 43 Iowa_Continuous_Corn/*.fastq -o spate-2014-11-07-15-21-34 > spate-2014-11-07-15-21-34.stdout

Submission Beagle) ./tests/Beagle_Cray_XE6/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-07-15-21-34 (df97d4a745d8d2696fb786646a0d995c97b9942d) 2908724.sdb

MachineUtilization ComputationLoad RunningTime Beagle) grep TIMER spate-2014-11-07-15-21-34.stdout TIMER [Load input / Count input data] 4 minutes, 1.086914 seconds TIMER [Load input / Distribute input data] 4 minutes, 7.503860 seconds TIMER [Load input] 8 minutes, 8.590790 seconds TIMER [Build assembly graph / Distribute vertices] 5 minutes, 30.568176 seconds TIMER [Build assembly graph / Distribute arcs] 9 minutes, 59.845520 seconds TIMER [Build assembly graph] 15 minutes, 30.413696 seconds

MemoryUtilization Checksum GoodComments BadComments NeutralComments

sebhtml commented 9 years ago

26f0780 tests: add many unit tests for the memory pool test suite 4b20e77 core: improve function that detects memory leaks in memory pool 1fe70a3 tests: enable unit test for memory pool c6b3424 core: fix function that checks memory leaks in memory pool

sebhtml commented 9 years ago

8 hour job:

Submitted build spate-2014-11-07-20-43-15 (26f0780ce81f9ad448a9e736301784bc828596b5) Beagle) qsub spate-2014-11-07-20-43-15.pbs
2908905.sdb

Beagle) grep ,walltime spate-2014-11-07-20-43-15.o2908905

cput=00:25:50,mem=6040kb,vmem=131176kb,walltime=08:00:22

Beagle) grep TIMER spate-2014-11-07-20-43-15.stdout TIMER [Load input / Count input data] 4 minutes, 1.845490 seconds TIMER [Load input / Distribute input data] 4 minutes, 49.082275 seconds TIMER [Load input] 8 minutes, 50.927734 seconds TIMER [Build assembly graph / Distribute vertices] 5 minutescore_manager/1021181 dies TIMER [Build assembly graph / Distribute arcs] 9 minutes, 55.694824 seconds TIMER [Build assembly graph] 15 minutes, 21.868958 seconds

DEBUG the system has 2252800 visitors

calculation: irb(main):003:0> 6144/24.0_22_400 => 2252800.0

thorium_worker_pool: node/130 EPOCH LOAD 1470 s 2.49/22 (0.11) 0.13 0.13 0.00 0.12 0.12 0.13 0.12 0.12 0.05 0.04 0.14 0.12 0.13 0.12 0.13 0.13 0.13 0.12 0.13 0.13 0.13 0.13 biosal_unitig_visitor/3084674 is ready to visit places in the universe thorium_worker_pool: node/130 EPOCH FUTURE_TIMELINE 1470 s 110 105 114 115 123 185 183 214 195 204 226 212 222 175 207 212 223 195 185 185 217 221

at "1460 s", graph is ready.

job ends at "28817 s"

sebhtml commented 9 years ago

Submitted build spate-2014-11-09-10-45-41 (e2513e10d0d83bccb41c21ca04a8c43d6295ac1a) 2909970.sdb

Each visitor must visit around 25000 vertices:

irb(main):002:0> (141189180698 - 111416442656)/2/256/22/100 => 26431

Beagle) grep TIMER spate-2014-11-09-10-45-41.stdout TIMER [Load input / Count input data] 3 minutes, 30.649719 seconds TIMER [Load input / Distribute input data] 3 minutes, 27.234467 seconds TIMER [Load input] 6 minutes, 57.884186 seconds TIMER [Build assembly graph / Distribute vertices] 5 minutes, 40.622131 seconds TIMER [Build assembly graph / Distribute arcs] 10 minutes, 4.028320 seconds TIMER [Build assembly graph] 15 minutes, 44.650452 seconds

Beagle) tail spate-2014-11-09-10-45-41.stdout biosal_unitig_visitor/1467767 visited 7500 vertices so far (velocity: 3.366248 vertices / s) biosal_unitig_visitor/1240886 visited 10500 vertices so far (velocity: 4.706409 vertices / s) biosal_unitig_visitor/1445941 visited 10500 vertices so far (velocity: 4.712747 vertices / s) biosal_unitig_visitor/1554232 visited 7500 vertices so far (velocity: 3.364738 vertices / s) biosal_unitig_visitor/1327208 visited 9000 vertices so far (velocity: 4.037685 vertices / s) biosal_unitig_visitor/1365092 visited 8000 vertices so far (velocity: 3.587444 vertices / s) biosal_unitig_visitor/1321576 visited 7500 vertices so far (velocity: 3.364738 vertices / s) biosal_unitig_visitor/1148652 visited 6500 vertices so far (velocity: 2.914798 vertices / s) biosal_unitig_visitor/1543035 visited 10500 vertices so far (velocity: 4.706409 vertices / s) biosal_unitig_visitor/1563851 visited 9500 vertices so far (velocity: 4.263914 vertices / s)

velocity is not that great...

DEBUG the system has 563200 visitors

sebhtml commented 9 years ago

JobName Goal 4 hours job:

it should visit all the vertices in < 2 hours

Machine Beagle

AllocationStatus Path /lustre/beagle2/CompBIO/automated-tests

Commit e2513e1

Toolchain Script Beagle) cat spate-2014-11-09-10-45-41-2.pbs

!/bin/bash

PBS -N spate-2014-11-09-10-45-41-2

PBS -A CI-DEB000002

PBS -l walltime=4:00:00

PBS -l mppwidth=6144

cd $PBS_O_WORKDIR export MPICH_NEMESIS_ASYNC_PROGRESS=1 export MPICH_MAX_THREAD_SAFETY=multiple export CRAY_MALLOPT_OFF=1

echo "Commit= e2513e10d0d83bccb41c21ca04a8c43d6295ac1a"

aprun -n 256 -N 1 -d 23 -r 1 \ spate-2014-11-09-10-45-41-2.spate -threads-per-node 23 -print-load \ -k 43 Iowa_Continuous_Corn/*.fastq -o spate-2014-11-09-10-45-41-2 > spate-2014-11-09-10-45-41-2.stdout

Submission Beagle) qsub spate-2014-11-09-10-45-41-2.pbs 2910029.sdb

MachineUtilization ComputationLoad RunningTime

Beagle) grep TIMER spate-2014-11-09-10-45-41-2.stdout TIMER [Load input / Count input data] 3 minutes, 30.913208 seconds TIMER [Load input / Distribute input data] 3 minutes, 29.836578 seconds TIMER [Load input] 7 minutes, 0.749786 seconds DEBUG_POOL Name= 0xa156d064 Self=0x2aab8bd1b7a8 Result= - ATIMER [Build assembly graph / Distribute vertices] 5 minutes, 36.316711 seconds TIMER [Build assembly graph / Distribute arcs] 9 minutes, 53.918091 seconds TIMER [Build assembly graph] 15 minutes, 30.234802 seconds TIMER [Visit vertices for unitigs] 61 minutes, 46.888428 seconds TIMER [Walk for unitigs] 23 minutes, 52.694702 seconds TIMER [Total] 108 minutes, 31.804199 seconds

MemoryUtilization Checksum GoodComments BadComments NeutralComments

Multiplexing rate

Beagle) tail spate-2014-11-09-10-45-41-2.stdout thorium_message_multiplexer: original_message_count 65692338 real_message_count 39290436 (0.5981) thorium_message_multiplexer: original_message_count 56643081 real_message_count 37397266 (0.6602) thorium_message_multiplexer: original_message_count 60924411 real_message_count 37200669 (0.6106) thorium_message_multiplexer: original_message_count 62630445 real_message_count 38497287 (0.6147) thorium_message_multiplexer: original_message_count 63264512 real_message_count 37680505 (0.5956) thorium_message_multiplexer: original_message_count 62402649 real_message_count 38596478 (0.6185) thorium_message_multiplexer: original_message_count 63499178 real_message_count 38680350 (0.6091) thorium_message_multiplexer: original_message_count 64174211 real_message_count 39051161 (0.6085) thorium_message_multiplexer: original_message_count 61591127 real_message_count 36873615 (0.5987) Application 5009918 resources: utime ~29758391s, stime ~8402871s, Rss ~24108432, inblocks ~2003228137, outblocks ~16703651

There was one multiplexer subsystem per node, not per worker.

29772738042 vertices with at least 2 coverage

biosal_unitig_visitor/1473433 visited 4500 vertices so far (velocity: 9.740260 vertices / s) biosal_unitig_visitor/1473433 visited 5000 vertices so far (velocity: 9.842520 vertices / s) biosal_unitig_visitor/1473433 visited 5500 vertices so far (velocity: 9.874327 vertices / s) biosal_unitig_visitor/1473433 visited 6000 vertices so far (velocity: 9.771987 vertices / s) biosal_unitig_visitor/1473433 visited 6500 vertices so far (velocity: 9.643917 vertices / s) biosal_unitig_visitor/1473433 visited 7000 vertices so far (velocity: 9.735744 vertices / s) biosal_unitig_visitor/1473433 visited 7500 vertices so far (velocity: 9.791122 vertices / s) biosal_unitig_visitor/1473433 visited 8000 vertices so far (velocity: 9.828010 vertices / s)

irb(main):001:0> 141189180698 - 111416442656 => 29772738042

predicted:

irb(main):005:0> 29772738042.0 / 2 / 256 / 22 / 100 / 10 / 60 => 44.0529386274858

actual time for visitors: 61 minutes

sebhtml commented 9 years ago

This ticket is done, graph traversal was performed on Beagle.

Now the result must be improved.

sebhtml commented 9 years ago

TODO: fill QA report completely (velocity, data size, graph size, memory usage, LOAD, basically every part of the QA report)

update: wait for Beagle. It will come back from maintenance on December 1st, 2014.

levinas commented 9 years ago

Curious what the scaling results would be for this one. What's the smallest number of Beagle nodes that would have enough memory to run this job?

sebhtml commented 9 years ago

I would say around 140.

memory usage sits at around 22-25 GiB / node, but there is a 3.5 GiB used even before the job starts.

levinas commented 9 years ago

Is the iowa-cc dataset around 250GB? That means the minimum system memory to input data size ratio is 140*32/250 = 18?

sebhtml commented 9 years ago

It is 450 GiB on disk, uncompressed. Once I fill the report, we will be able to do these estimates (Beagle is not online right now).

sebhtml commented 9 years ago

Beagle will come back online on December 1st, 2014.