Closed sebhtml closed 9 years ago
Last good one (#751):
Beagle) grep TIMER spate-2014-11-02-12-27-34.stdout (082485e351f78990c3c13a7c3f2a2df84a3d7856) TIMER [Load input / Count input data] 3 minutes, 37.046280 seconds TIMER [Load input / Distribute input data] 3 minutes, 21.185928 seconds TIMER [Load input] 6 minutes, 58.232208 seconds TIMER [Build assembly graph / Distribute vertices] 5 minutes, 48.143402 seconds TIMER [Build assembly graph / Distribute arcs] 11 minutes, 2.775940 seconds TIMER [Build assembly graph] 16 minutes, 50.919373 seconds
Beagle) head -n3 spate-2014-11-02-12-27-34.stdout thorium_transport: type mpi1_pt2pt_transport thorium_scheduler: type cfs_scheduler thorium_message_multiplexer: disabled=0 buffer_size_in_bytes=1024 timeout_in_nanoseconds=200000
Changes between 082485e351f78990c3c13a7c3f2a2df84a3d7856 and 07d6f14:
FAILED SCHEDULED 748e42d cray: don't enable CONFIG_DEBUG by default
5a0882a core: examine memory pool for leak
903ee38 latency_probe: use a random start and round-robin selection
8fb4156 latency_probe: display number of send and received messages at the end
9e1e51e latency_probe: use source actor script
031a2dd latency_probe: add source actor script
575f3bb latency_probe: add target actors
fa78911 latency_probe: add target actor script
eb99e13 thorium_worker: simplify random seed
a1380de thorium_node: fix regression added in 47a651c857
91d2aff thorium_worker: use a better random seed
a0f27c7 thorium: fix regression added in 89300f4ee regarding random numbers
89300f4 thorium: use rand_r() inside worker whenever possible
e9b2e9b ring: always use a memory fence before incrementing the tail
9daab5d thorium_worker: push message directly in ring when possible
1ee0739 thorium_worker: put message directly in outbound message ring
f9fb696 thorium: rename function to thorium_worker_send_local_delivery
077e59e scripts: don't search for action specifiers defined with a BASE
389ce63 thorium_worker: add function to send message to other nodes
5817252 thorium_worker: add function to send with multiplexer
e2bf6c7 performance: limit the number of targets to reduce memory usage
47a651c thorium: assign an actor to a worker when it is spawned
5680d06 thorium: add a function to get the worker name of an actor
655a0be thorium: call rand_r() in actor instead of in worker
6fe3f02 Merge branch 'energy' of github.com:sebhtml/biosal into energy
9621e0d tests: use CORE_DEBUGGER_ASSERT_ENABLED for complex assertions
bfc1d16 tests: don't use rand() in tests, use rand_r() instead
3ff5509 performance: avoid the function rand() in receive()
4cbbb4b examples: don't use rand() in actors
de2c30c genomics: don't use rand() inside actors
2b7a5b7 thorium: don't use rand() to avoid system calls (sys_futex)
FAILED SCHEDULED 07d6f14 tests: fix Beagle launch script for Spate
9006075 tracing: fix LTTng option
SUSPICIOUS 8e0448d thorium: change action naming to follow Minix-style convention
f7ca988 performance: avoid having too many I/O syscalls
c060534 latency_probe: display number of reply messages
87ecc97 thorium: tests show that Iprobe/Recv is better than Irecv/Test
1a32240 thorium: don't disable the multiplexer on BGQ
2c7df75 tests: add prefix in copy command for the build artifact
f53e34a tests: use good executable for latency_probe test
f36d2f0 core: fix compilation error on Blue Gene/Q
07353e8 core: POWER7 has a relaxed (weak) memory model
SUSPICIOUS 33acaeb core: make core_memory_fence() not 'static inline' for profiling
525603c core: simplify fast_ring interface
6942260 core: move spinlock directly in the critical section function
b0b038e performance: fix option in latency_probe
46fde8e scripts: the argument -I. is not needed anymore
898c053 core: use CONFIG_DEBUG and not THORIUM_DEBUG
5b74c66 tests: add launch script for latency_probe on Blue Gene/Q
75a0d13 documentation: improve transport interface documentation
7b2b66a thorium: don't use a aloop in mpi1_pt2pt_transport_test
dbbb6aa thorium_transport: remove outdated constant
723785e thorium_transport: fix documentation of mock_transport
b4a407d mpi1_pt2pt_nonblocking: don't use a loop in test()
3545aa3 thorium_transport: use PT2PT and not P2P
9b5a030 thorium: don't use a loop in mpi1_pt2pt_nonblocking receive()
dfb3879 thorium: set mpi1_pt2pt_nonblocking as default transport subsystem
131e947 documentation: improve compilation option documentation
fdc1296 documentation: fix compilation option table
d981e81 documentation: fix error in readme
89b2a93 documentation: add more compilation options
5fd62d4 documentation: add compilation option in documentation
0330965 thorium: fix memory leak in the demultiplexing code path
80d7c1a build: don't use MPI on Xeon Phi for tests
5c3a601 performance: also include ACTION_PING_REPLY messages in results
c541b41 intel: add script to build on Intel Xeon Phi 7120A
FAILED SCHEDULED 2d04562 build: add option CONFIG_CLOCK_GETTIME=n
15e03f5 build: fix minimal build
7c2dda5 build: improve the option CONFIG_LTTNG
c190bdf build: change THORIUM_DEBUG to CONFIG_DEBUG
2a0ef48 thorium_transport: add fallback transport that does nothing (mock)
1f71cd3 thorium: add compilation option CONFIG_MPI=n to disable MPI
bedee9b documentation: add documentation for CONFIG_LTTNG
07914ff build: add compilation option CONFIG_LTTNG
e41c37c build: rename CONFIG_FLAGS to CONFIG_CFLAGS
b6c3a1d build: add option to disable support for zlib (CONFIG_ZLIB=n)
9e04b23 thorium: add CONFIG_LDFLAGS for optional LDFLAGS
6685b2f core: add independent makefile for file storage
6d528c9 thorium: add independent makefile for schedulers
9031f41 thorium: use independent make file for transport
ee14fd7 thorium: set default value for CONFIG_PAMI=n
7ee9129 performance: fix 4x7 script for latency_probe tests
907d236 thorium: remove empty lines in multiplexer
FAILED SCHEDULED d5d3cec performance: don't show worker count in latency_probe
f9c0ce4 thorium_node: show debug mode in output
3f8852e tests: add more performance tests with latency_probe
SUSPICIOUS 5b3504b core: fix a bug in incremental resizing of hash table
766998c performance: add 2 script to measure throughput on multi-core
ad37b83 engine: use rand_r (scalable) instead of rand (protected)
d9f752a tests: add some tests in the test_map_delete suite
3d18728 tests: add a test for deleting from a map
b948cf9 core: add assertions in memory pool to track a bug
FAILED SCHEDULED 72336f9 thorium: print type of message in message_print
c6730a1 core: verify if pointer is managed by pool if tracking is enabled
BLOCK_START
FAILED SCHEDULED SUSPICIOUS 153eea9 thorium_node: use message type to select pool to free buffer
PASSED SCHEDULED c55aa06 transport: disable nonblocking communication owing to a bug
DIFFERENT_PROBLEM SCHEDULED 3cc35e1 transport: a MPI message can have 0 bytes (this is allowed)
e463f70 transport: use nonblocking transport by default
PASSED SCHEDULED 1cfbfc2 tests: generate unique test artifact for spate automated tests
BLOCK_END
NOT_LIKELY 48be536 test: use unique executable artefact for latency_probe test
NOT_LIKELY 02f617d test: fix the path to latency_probe
NOT_LIKELY 5469856 tests: first generation for latency_probe on Beagle
PASSED SCHEDULED 082485e thorium: make sure that transport requests don't accumulate
with new changes:
Submitted build spate-2014-11-06-13-37-43 (89300f4ee6d4cddcaad6c633b980620df32d4981) 2906385.sdb
result: deleted.
latency_probe with new changes:
Submitted build latency_probe-2014-11-06-13-50-55 (a0f27c733a825cc336c99ce8e3c970f140155267) 2906408.sdb Beagle) grep COUNTER latency_probe-2014-11-06-13-50-55.stdout PERFORMANCE_COUNTER type = ping-pong PERFORMANCE_COUNTER ping-action = ACTION_PING PERFORMANCE_COUNTER pong-action = ACTION_PING_REPLY PERFORMANCE_COUNTER node-count = 256 PERFORMANCE_COUNTER worker-count-per-node = 22 PERFORMANCE_COUNTER actor-count-per-worker = 100 PERFORMANCE_COUNTER worker-count = 5632 PERFORMANCE_COUNTER actor-count = 563200 PERFORMANCE_COUNTER ping-message-count-per-actor = 40000 PERFORMANCE_COUNTER ping-message-count = 22528000000 PERFORMANCE_COUNTER pong-message-count = 22528000000 PERFORMANCE_COUNTER message-count = 45056000000 PERFORMANCE_COUNTER elapsed-time = 852.583325 s PERFORMANCE_COUNTER computation-throughput = 52846447.563432 messages / s PERFORMANCE_COUNTER node-throughput = 206431.435795 messages / s PERFORMANCE_COUNTER worker-throughput = 9383.247082 messages / s PERFORMANCE_COUNTER worker-latency = 106572 ns PERFORMANCE_COUNTER actor-throughput = 93.832471 messages / s PERFORMANCE_COUNTER actor-latency = 10657291 ns
Submitted build latency_probe-2014-11-06-14-08-21 (a0f27c733a825cc336c99ce8e3c970f140155267) 2906410.sdb
Beagle) grep COUNTER latency_probe-2014-11-06-14-08-21.stdout PERFORMANCE_COUNTER type = ping-pong PERFORMANCE_COUNTER ping-action = ACTION_PING PERFORMANCE_COUNTER pong-action = ACTION_PING_REPLY PERFORMANCE_COUNTER node-count = 256 PERFORMANCE_COUNTER worker-count-per-node = 22 PERFORMANCE_COUNTER actor-count-per-worker = 100 PERFORMANCE_COUNTER worker-count = 5632 PERFORMANCE_COUNTER actor-count = 563200 PERFORMANCE_COUNTER ping-message-count-per-actor = 40000 PERFORMANCE_COUNTER ping-message-count = 22528000000 PERFORMANCE_COUNTER pong-message-count = 22528000000 PERFORMANCE_COUNTER message-count = 45056000000 PERFORMANCE_COUNTER elapsed-time = 859.137648 s PERFORMANCE_COUNTER computation-throughput = 52443284.386737 messages / s PERFORMANCE_COUNTER node-throughput = 204856.579636 messages / s PERFORMANCE_COUNTER worker-throughput = 9311.662711 messages / s PERFORMANCE_COUNTER worker-latency = 107392 ns PERFORMANCE_COUNTER actor-throughput = 93.116627 messages / s PERFORMANCE_COUNTER actor-latency = 10739220 ns
Beagle) ./tests/Beagle_Cray_XE6/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-06-21-17-41 (9e1e51e0dedcb45d65eb0afcaafa882680dcf0a1) 2906743.sdb
Result:
Beagle) grep TIMER spate-2014-11-06-21-17-41.stdout TIMER [Load input / Count input data] 4 minutes, 3.643066 seconds TIMER [Load input / Distribute input data] 4 minutes, 5.572739 seconds TIMER [Load input] 8 minutes, 9.215820 seconds TIMER [Build assembly graph / Distribute vertices] 13 minutes, 49.967224 seconds
with CONFIG_DEBUG=n:
Beagle) pwd /lustre/beagle/CompBIO/biosal-THOR/biosal Beagle) git diff Beagle) pwd /lustre/beagle/CompBIO/biosal-THOR/biosal Beagle) git diff|cat diff --git a/scripts/Cray_XE6/build-gnu.sh b/scripts/Cray_XE6/build-gnu.sh index 2eb64d4..796f9f9 100755 --- a/scripts/Cray_XE6/build-gnu.sh +++ b/scripts/Cray_XE6/build-gnu.sh @@ -23,6 +23,6 @@ module load xpmem/0.1-2.0402.44035.2.1.gem module load udreg/2.3.2-1.0402.7311.2.1.gem
make clean -make CC=cc -j 4 applications/argonnite_kmer_counter/argonnite CONFIG_DEBUG=y \ +make CC=cc -j 4 applications/argonnite_kmer_counter/argonnite \ applications/spate_metagenome_assembler/spate \ performance/latency_probe/latency_probe
Beagle) ./tests/Beagle_Cray_XE6/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-07-09-34-59 (5a0882af531a44830c9982d25ee672b9793c4788) 2907275.sdb
Result: this script pulls from github, so the patch did not do anything
without CONFIG_DEBUG
Submitted build spate-2014-11-07-09-58-19 (748e42d5e043963ce07b307548db81783058dd1f) 2907299.sdb
verification: Beagle) ./spate-2014-11-07-09-58-19.spate -transport mock_transport|grep CONFIG_DEBUG|wc -l 0
Beagle) grep TIMER spate-2014-11-07-09-58-19.stdout.2 TIMER [Load input / Count input data] 3 minutes, 39.050400 seconds TIMER [Load input / Distribute input data] 3 minutes, 30.601685 seconds TIMER [Load input] 7 minutes, 9.652100 seconds TIMER [Build assembly graph / Distribute vertices] 5 minutes, 21.558533 seconds
The running time was higher due to new assertions.
There were 2 problems:
memory usage:
Beagle) grep " ByteCount" spate-2014-11-07-09-58-19.stdout.2|tail thorium_node: node/167 METRICS AliveActorCount: 89 ByteCount: 50011672576 / 33877495808 thorium_node: node/86 METRICS AliveActorCount: 89 ByteCount: 49684054016 / 33877495808 thorium_node: node/153 METRICS AliveActorCount: 89 ByteCount: 48694865920 / 33877495808 thorium_node: node/140 METRICS AliveActorCount: 89 ByteCount: 49917792256 / 33877495808 thorium_node: node/247 METRICS AliveActorCount: 89 ByteCount: 49909428224 / 33877495808 thorium_node: node/248 METRICS AliveActorCount: 89 ByteCount: 50654879744 / 33877495808 thorium_node: node/244 METRICS AliveActorCount: 89 ByteCount: 49980608512 / 33877495808 thorium_node: node/74 METRICS AliveActorCount: 89 ByteCount: 50292686848 / 33877495808 thorium_node: node/150 METRICS AliveActorCount: 89 ByteCount: 50387423232 / 33877495808 thorium_node: node/12 METRICS AliveActorCount: 89 ByteCount: 50047279104 / 33877495808
Submitted build spate-2014-11-07-11-23-28 (2d04562) 2907618.sdb thorium_node: node/153 METRICS AliveActorCount: 89 ByteCount: 36071854080 / 33877495808
Beagle) export BUILD_COMMIT=d5d3cec Beagle) ./tests/Beagle_Cray_XE6/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-07-11-43-50 (d5d3cec) 2907621.sdb
Result: thorium_node: node/5 METRICS AliveActorCount: 89 ByteCount: 35510394880 / 33877495808
Beagle) export BUILD_COMMIT=72336f9 Beagle) ./tests/Beagle_Cray_XE6/launch-Spate-Iowa-Contiunous-Corn.sh
Submitted build spate-2014-11-07-12-11-32 (72336f9) 2907622.sdb thorium_node: node/247 METRICS AliveActorCount: 89 ByteCount: 41266630656 / 33877495808
Beagle) export BUILD_COMMIT=3cc35e1 Beagle) ./tests/Beagle_Cray_XE6/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-07-12-31-45 (3cc35e1) 2907639.sdb
OOM, before getting all vertices
Verify the last known version to work:
manual build of 082485e Beagle) qsub spate-2014-11-07-12-31-45.pbs 2907640.sdb
It is OK, waiting for TIMER to print time...
Beagle) grep TIMER spate-2014-11-07-12-31-45.stdout TIMER [Load input / Count input data] 4 minutes, 3.089340 seconds TIMER [Load input / Distribute input data] 3 minutes, 55.463501 seconds TIMER [Load input] 7 minutes, 58.552856 seconds TIMER [Build assembly graph / Distribute vertices] 5 minutes, 49.959473 seconds TIMER [Build assembly graph / Distribute arcs] 10 minutes, 43.182678 seconds TIMER [Build assembly graph] 16 minutes, 33.142151 seconds
manual build -> 1cfbfc2 build manually cp to build artifact path submit Beagle) qsub spate-2014-11-07-13-21-33.pbs 2907643.sdb
criterion: PASSED if arc extraction get passed 50%. (the memory leak brings down the whole thing around 80%).
result: Beagle) tail spate-2014-11-07-13-21-33.stdout -n1 sequence store 1006325 has 147518/393216 (0.38) entries left to produce Beagle) grep ByteCount spate-2014-11-07-13-21-33.stdout|tail -n1 thorium_node: node/15 METRICS AliveActorCount: 89 ByteCount: 22811369472 / 33877495808
manual submission (checkout, build, cp artifact) 153eea9 spate-2014-11-07-13-59-39
Beagle) qsub spate-2014-11-07-13-59-39.pbs 2907656.sdb
Beagle) grep ByteCount spate-2014-11-07-13-59-39.stdout |tail -n1
thorium_node: node/84 METRICS AliveActorCount: 89 ByteCount: 45073874944 / 33877495808
c55aa06 spate-2014-11-07-14-32-57 Beagle) qsub spate-2014-11-07-14-32-57.pbs 2907733.sdb
Beagle) grep TIMER spate-2014-11-07-14-32-57.stdout TIMER [Load input / Count input data] 4 minutes, 0.524765 seconds TIMER [Load input / Distribute input data] 4 minutes, 13.131958 seconds TIMER [Load input] 8 minutes, 13.656738 seconds TIMER [Build assembly graph / Distribute vertices] 5 minutes, 46.259430 seconds ^C Beagle) tail -n1 spate-2014-11-07-14-32-57.stdout sequence store 1004066 has 127780/393216 (0.32) entries left to produce
The regression is caused by
153eea9 thorium_node: use message type to select pool to free buffer
[boisvert@bigmem biosal]$ git show --stat 153eea9 commit 153eea9475de41fad944bb3cfccd233091f8d344 Author: Sébastien Boisvert boisvert@anl.gov Date: Sun Nov 2 21:38:27 2014 -0500
thorium_node: use message type to select pool to free buffer
Signed-off-by: Sébastien Boisvert <boisvert@anl.gov>
core/system/memory_pool.c | 30 +++++++++++++++++++++++++----- core/system/memory_pool.h | 2 +- engine/thorium/message.c | 8 +++++--- engine/thorium/node.c | 15 +++++++++++++-- performance/latency_probe/process.c | 1 - 5 files changed, 44 insertions(+), 12 deletions(-)
before closing:
JobName spate-2014-11-07-15-21-34
Goal Verify if patch fixes memory leak
Machine Beagle
AllocationStatus Path /lustre/beagle/CompBIO/automated-tests
Commit df97d4a745d8d2696fb786646a0d995c97b9942d
Toolchain Beagle) cc --version /opt/cray/xt-asyncpe/5.22/bin/cc: INFO: Compiling with CRAYPE_COMPILE_TARGET=native. gcc (GCC) 4.8.1 20130531 (Cray Inc.) Copyright (C) 2013 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Script Beagle) cat spate-2014-11-07-15-21-34.pbs
cd $PBS_O_WORKDIR export MPICH_NEMESIS_ASYNC_PROGRESS=1 export MPICH_MAX_THREAD_SAFETY=multiple export CRAY_MALLOPT_OFF=1
echo "Commit= df97d4a745d8d2696fb786646a0d995c97b9942d"
aprun -n 256 -N 1 -d 23 -r 1 \ spate-2014-11-07-15-21-34.spate -threads-per-node 23 -print-load \ -k 43 Iowa_Continuous_Corn/*.fastq -o spate-2014-11-07-15-21-34 > spate-2014-11-07-15-21-34.stdout
Submission Beagle) ./tests/Beagle_Cray_XE6/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-07-15-21-34 (df97d4a745d8d2696fb786646a0d995c97b9942d) 2908724.sdb
MachineUtilization ComputationLoad RunningTime Beagle) grep TIMER spate-2014-11-07-15-21-34.stdout TIMER [Load input / Count input data] 4 minutes, 1.086914 seconds TIMER [Load input / Distribute input data] 4 minutes, 7.503860 seconds TIMER [Load input] 8 minutes, 8.590790 seconds TIMER [Build assembly graph / Distribute vertices] 5 minutes, 30.568176 seconds TIMER [Build assembly graph / Distribute arcs] 9 minutes, 59.845520 seconds TIMER [Build assembly graph] 15 minutes, 30.413696 seconds
MemoryUtilization Checksum GoodComments BadComments NeutralComments
26f0780 tests: add many unit tests for the memory pool test suite 4b20e77 core: improve function that detects memory leaks in memory pool 1fe70a3 tests: enable unit test for memory pool c6b3424 core: fix function that checks memory leaks in memory pool
8 hour job:
Submitted build spate-2014-11-07-20-43-15 (26f0780ce81f9ad448a9e736301784bc828596b5)
Beagle) qsub spate-2014-11-07-20-43-15.pbs
2908905.sdb
Beagle) grep ,walltime spate-2014-11-07-20-43-15.o2908905
Beagle) grep TIMER spate-2014-11-07-20-43-15.stdout TIMER [Load input / Count input data] 4 minutes, 1.845490 seconds TIMER [Load input / Distribute input data] 4 minutes, 49.082275 seconds TIMER [Load input] 8 minutes, 50.927734 seconds TIMER [Build assembly graph / Distribute vertices] 5 minutescore_manager/1021181 dies TIMER [Build assembly graph / Distribute arcs] 9 minutes, 55.694824 seconds TIMER [Build assembly graph] 15 minutes, 21.868958 seconds
DEBUG the system has 2252800 visitors
calculation: irb(main):003:0> 6144/24.0_22_400 => 2252800.0
thorium_worker_pool: node/130 EPOCH LOAD 1470 s 2.49/22 (0.11) 0.13 0.13 0.00 0.12 0.12 0.13 0.12 0.12 0.05 0.04 0.14 0.12 0.13 0.12 0.13 0.13 0.13 0.12 0.13 0.13 0.13 0.13 biosal_unitig_visitor/3084674 is ready to visit places in the universe thorium_worker_pool: node/130 EPOCH FUTURE_TIMELINE 1470 s 110 105 114 115 123 185 183 214 195 204 226 212 222 175 207 212 223 195 185 185 217 221
at "1460 s", graph is ready.
job ends at "28817 s"
Submitted build spate-2014-11-09-10-45-41 (e2513e10d0d83bccb41c21ca04a8c43d6295ac1a) 2909970.sdb
Each visitor must visit around 25000 vertices:
irb(main):002:0> (141189180698 - 111416442656)/2/256/22/100 => 26431
Beagle) grep TIMER spate-2014-11-09-10-45-41.stdout TIMER [Load input / Count input data] 3 minutes, 30.649719 seconds TIMER [Load input / Distribute input data] 3 minutes, 27.234467 seconds TIMER [Load input] 6 minutes, 57.884186 seconds TIMER [Build assembly graph / Distribute vertices] 5 minutes, 40.622131 seconds TIMER [Build assembly graph / Distribute arcs] 10 minutes, 4.028320 seconds TIMER [Build assembly graph] 15 minutes, 44.650452 seconds
Beagle) tail spate-2014-11-09-10-45-41.stdout biosal_unitig_visitor/1467767 visited 7500 vertices so far (velocity: 3.366248 vertices / s) biosal_unitig_visitor/1240886 visited 10500 vertices so far (velocity: 4.706409 vertices / s) biosal_unitig_visitor/1445941 visited 10500 vertices so far (velocity: 4.712747 vertices / s) biosal_unitig_visitor/1554232 visited 7500 vertices so far (velocity: 3.364738 vertices / s) biosal_unitig_visitor/1327208 visited 9000 vertices so far (velocity: 4.037685 vertices / s) biosal_unitig_visitor/1365092 visited 8000 vertices so far (velocity: 3.587444 vertices / s) biosal_unitig_visitor/1321576 visited 7500 vertices so far (velocity: 3.364738 vertices / s) biosal_unitig_visitor/1148652 visited 6500 vertices so far (velocity: 2.914798 vertices / s) biosal_unitig_visitor/1543035 visited 10500 vertices so far (velocity: 4.706409 vertices / s) biosal_unitig_visitor/1563851 visited 9500 vertices so far (velocity: 4.263914 vertices / s)
velocity is not that great...
DEBUG the system has 563200 visitors
JobName Goal 4 hours job:
it should visit all the vertices in < 2 hours
Machine Beagle
AllocationStatus Path /lustre/beagle2/CompBIO/automated-tests
Commit e2513e1
Toolchain Script Beagle) cat spate-2014-11-09-10-45-41-2.pbs
cd $PBS_O_WORKDIR export MPICH_NEMESIS_ASYNC_PROGRESS=1 export MPICH_MAX_THREAD_SAFETY=multiple export CRAY_MALLOPT_OFF=1
echo "Commit= e2513e10d0d83bccb41c21ca04a8c43d6295ac1a"
aprun -n 256 -N 1 -d 23 -r 1 \ spate-2014-11-09-10-45-41-2.spate -threads-per-node 23 -print-load \ -k 43 Iowa_Continuous_Corn/*.fastq -o spate-2014-11-09-10-45-41-2 > spate-2014-11-09-10-45-41-2.stdout
Submission Beagle) qsub spate-2014-11-09-10-45-41-2.pbs 2910029.sdb
MachineUtilization ComputationLoad RunningTime
Beagle) grep TIMER spate-2014-11-09-10-45-41-2.stdout TIMER [Load input / Count input data] 3 minutes, 30.913208 seconds TIMER [Load input / Distribute input data] 3 minutes, 29.836578 seconds TIMER [Load input] 7 minutes, 0.749786 seconds DEBUG_POOL Name= 0xa156d064 Self=0x2aab8bd1b7a8 Result= - ATIMER [Build assembly graph / Distribute vertices] 5 minutes, 36.316711 seconds TIMER [Build assembly graph / Distribute arcs] 9 minutes, 53.918091 seconds TIMER [Build assembly graph] 15 minutes, 30.234802 seconds TIMER [Visit vertices for unitigs] 61 minutes, 46.888428 seconds TIMER [Walk for unitigs] 23 minutes, 52.694702 seconds TIMER [Total] 108 minutes, 31.804199 seconds
MemoryUtilization Checksum GoodComments BadComments NeutralComments
Multiplexing rate
Beagle) tail spate-2014-11-09-10-45-41-2.stdout thorium_message_multiplexer: original_message_count 65692338 real_message_count 39290436 (0.5981) thorium_message_multiplexer: original_message_count 56643081 real_message_count 37397266 (0.6602) thorium_message_multiplexer: original_message_count 60924411 real_message_count 37200669 (0.6106) thorium_message_multiplexer: original_message_count 62630445 real_message_count 38497287 (0.6147) thorium_message_multiplexer: original_message_count 63264512 real_message_count 37680505 (0.5956) thorium_message_multiplexer: original_message_count 62402649 real_message_count 38596478 (0.6185) thorium_message_multiplexer: original_message_count 63499178 real_message_count 38680350 (0.6091) thorium_message_multiplexer: original_message_count 64174211 real_message_count 39051161 (0.6085) thorium_message_multiplexer: original_message_count 61591127 real_message_count 36873615 (0.5987) Application 5009918 resources: utime ~29758391s, stime ~8402871s, Rss ~24108432, inblocks ~2003228137, outblocks ~16703651
There was one multiplexer subsystem per node, not per worker.
29772738042 vertices with at least 2 coverage
biosal_unitig_visitor/1473433 visited 4500 vertices so far (velocity: 9.740260 vertices / s) biosal_unitig_visitor/1473433 visited 5000 vertices so far (velocity: 9.842520 vertices / s) biosal_unitig_visitor/1473433 visited 5500 vertices so far (velocity: 9.874327 vertices / s) biosal_unitig_visitor/1473433 visited 6000 vertices so far (velocity: 9.771987 vertices / s) biosal_unitig_visitor/1473433 visited 6500 vertices so far (velocity: 9.643917 vertices / s) biosal_unitig_visitor/1473433 visited 7000 vertices so far (velocity: 9.735744 vertices / s) biosal_unitig_visitor/1473433 visited 7500 vertices so far (velocity: 9.791122 vertices / s) biosal_unitig_visitor/1473433 visited 8000 vertices so far (velocity: 9.828010 vertices / s)
irb(main):001:0> 141189180698 - 111416442656 => 29772738042
predicted:
irb(main):005:0> 29772738042.0 / 2 / 256 / 22 / 100 / 10 / 60 => 44.0529386274858
actual time for visitors: 61 minutes
This ticket is done, graph traversal was performed on Beagle.
Now the result must be improved.
TODO: fill QA report completely (velocity, data size, graph size, memory usage, LOAD, basically every part of the QA report)
update: wait for Beagle. It will come back from maintenance on December 1st, 2014.
Curious what the scaling results would be for this one. What's the smallest number of Beagle nodes that would have enough memory to run this job?
I would say around 140.
memory usage sits at around 22-25 GiB / node, but there is a 3.5 GiB used even before the job starts.
Is the iowa-cc dataset around 250GB? That means the minimum system memory to input data size ratio is 140*32/250 = 18?
It is 450 GiB on disk, uncompressed. Once I fill the report, we will be able to do these estimates (Beagle is not online right now).
Beagle will come back online on December 1st, 2014.
Submitted build spate-2014-11-05-17-12-23 (07d6f14edf312e8ca6f371ae4a8e5172c758aac6) 2905036.sdb
Beagle) grep TIMER spate-2014-11-05-17-12-23.stdout TIMER [Load input / Count input data] 3 minutes, 41.167130 seconds TIMER [Load input / Distribute input data] 3 minutes, 34.626663 seconds TIMER [Load input] 7 minutes, 15.793793 seconds TIMER [Build assembly graph / Distribute vertices] 14 minutes, 10.378845 seconds
Beagle) tail -n1 spate-2014-11-05-17-12-23.e2905036 [NID 00713] 2014-11-05 17:38:33 Apid 4997901: OOM killer terminated this process.