GeneAssembly / biosal

biosal is a distributed BIOlogical Sequence Actor Library. THIS IS A MIRROR.
BSD 2-Clause "Simplified" License
6 stars 1 forks source link

run latency_probe on mira #814

Closed sebhtml closed 10 years ago

sebhtml commented 10 years ago

[boisvert@miralac1 biosal]$ tests/Cetus_IBM_Blue_Gene_Q/launch-latency_probe.sh Submitted build latency_probe-2014-11-12-03-57-16 (211d4e1893bc70617764ba534068e3d2f196dbdd) 362581

[boisvert@miralac1 automated-tests]$ grep COUNTER latency_probe-2014-11-12-03-57-16.output PERFORMANCE_COUNTER type = ping-pong PERFORMANCE_COUNTER ping-action = ACTION_PING PERFORMANCE_COUNTER pong-action = ACTION_PING_REPLY PERFORMANCE_COUNTER node-count = 1024 PERFORMANCE_COUNTER worker-count-per-node = 15 PERFORMANCE_COUNTER actor-count-per-worker = 100 PERFORMANCE_COUNTER worker-count = 15360 PERFORMANCE_COUNTER actor-count = 1536000 PERFORMANCE_COUNTER ping-message-count-per-actor = 40000 PERFORMANCE_COUNTER ping-message-count = 61440000000 PERFORMANCE_COUNTER pong-message-count = 61440000000 PERFORMANCE_COUNTER message-count = 122880000000 PERFORMANCE_COUNTER elapsed-time = 1211.814314 s PERFORMANCE_COUNTER computation-throughput = 101401673.984160 messages / s PERFORMANCE_COUNTER node-throughput = 99025.072250 messages / s PERFORMANCE_COUNTER worker-throughput = 6601.671483 messages / s PERFORMANCE_COUNTER worker-latency = 151476 ns PERFORMANCE_COUNTER actor-throughput = 66.016715 messages / s PERFORMANCE_COUNTER actor-latency = 15147678 ns

Previous was 40.2 M msg / s https://github.com/GeneAssembly/biosal/issues/786#issuecomment-62167011

mira alcf 1024x16

Message batching

[boisvert@miralac1 automated-tests]$ tail latency_probe-2014-11-12-03-57-16.output thorium_message_multiplexer: original_message_count 7968063 real_message_count 2849081 (0.3576) thorium_message_multiplexer: original_message_count 7969191 real_message_count 2832148 (0.3554) thorium_message_multiplexer: original_message_count 7966675 real_message_count 2819529 (0.3539) thorium_message_multiplexer: original_message_count 7962088 real_message_count 2834350 (0.3560) thorium_message_multiplexer: original_message_count 7968680 real_message_count 2846612 (0.3572) thorium_message_multiplexer: original_message_count 7962284 real_message_count 2839215 (0.3566) thorium_message_multiplexer: original_message_count 7969849 real_message_count 2846057 (0.3571) thorium_message_multiplexer: original_message_count 7970920 real_message_count 2842008 (0.3565) thorium_message_multiplexer: original_message_count 7967976 real_message_count 2850384 (0.3577) thorium_message_multiplexer: original_message_count 7964368 real_message_count 2846588 (0.3574)

Theory

A node has 1500 sources (15 * 100) and 15 targets (15 * 1). There are 1023 destinations. A promotion rate of 35% is high, but still quite good given that 1024 / 1500.0 = 0.6826. If you include the replies: 1024 / (1500.0 * 2) = 0.3413. 35% is a little bit higher than the theoretical maximum.

Memory usage

thorium_node: node/894 METRICS AliveActorCount: 1516 ByteCount: 1000345600 / 17163091968

sebhtml commented 10 years ago

Goal Try with adaptive multiplexer and with 48 threads per node

[boisvert@miralac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-latency_probe.sh \ Project 'compbio'; job rerouted to queue 'prod-short' Submitted build latency_probe-2014-11-13-22-12-46 (509472f94bdbb14b13346ee5d3a9573763a2f5f3) 364015

does not look very balanced

sebhtml commented 10 years ago

Goal

Test new manual multiplexer and test with 2 threads / core

[boisvert@cetuslac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-latency_probe.sh

Submitted build latency_probe-2014-11-14-02-41-55 (b0ecb745f90d7261a116a6175c7e57beb4b4b680) 364198

Result: did not finish in < 1 h

sebhtml commented 10 years ago

Goal

Run latency_probe with 16 threads / node

Submitted build latency_probe-2014-11-14-15-28-44 (5c9f3548a2ed15f040ac305b71fadaaf6325b0fb) 364494

thorium_message_multiplexer: original_message_count 7968152 real_message_count 7952528 (0.9980) thorium_message_multiplexer: original_message_count 7967934 real_message_count 7952160 (0.9980) thorium_message_multiplexer: original_message_count 7974205 real_message_count 7958714 (0.9981) thorium_message_multiplexer: original_message_count 7970399 real_message_count 7955233 (0.9981) thorium_message_multiplexer: original_message_count 7964011 real_message_count 7948475 (0.9980) thorium_message_multiplexer: original_message_count 7967196 real_message_count 7951732 (0.9981) thorium_message_multiplexer: original_message_count 7970107 real_message_count 7954916 (0.9981)

[boisvert@cetuslac1 automated-tests]$ grep COUNTER latency_probe-2014-11-14-15-28-44.output PERFORMANCE_COUNTER type = ping-pong PERFORMANCE_COUNTER ping-action = ACTION_PING PERFORMANCE_COUNTER pong-action = ACTION_PING_REPLY PERFORMANCE_COUNTER node-count = 1024 PERFORMANCE_COUNTER worker-count-per-node = 15 PERFORMANCE_COUNTER actor-count-per-worker = 100 PERFORMANCE_COUNTER worker-count = 15360 PERFORMANCE_COUNTER actor-count = 1536000 PERFORMANCE_COUNTER ping-message-count-per-actor = 40000 PERFORMANCE_COUNTER ping-message-count = 61440000000 PERFORMANCE_COUNTER pong-message-count = 61440000000 PERFORMANCE_COUNTER message-count = 122880000000 PERFORMANCE_COUNTER elapsed-time = 3121.903942 s PERFORMANCE_COUNTER computation-throughput = 39360596.056063 messages / s PERFORMANCE_COUNTER node-throughput = 38438.082086 messages / s PERFORMANCE_COUNTER worker-throughput = 2562.538806 messages / s PERFORMANCE_COUNTER worker-latency = 390237 ns PERFORMANCE_COUNTER actor-throughput = 25.625388 messages / s PERFORMANCE_COUNTER actor-latency = 39023799 ns

sebhtml commented 10 years ago

Goal Test with constant timeout and 32 threads / node

[boisvert@cetuslac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-latency_probe.sh

Submitted build latency_probe-2014-11-14-16-51-52 (5cc9354ca264811d4f4e8f560d143c9b61e443bc) 364565

"too many reboot attempts"

Second attempt: [boisvert@cetuslac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-latency_probe.sh Submitted build latency_probe-2014-11-14-23-35-56 (558c83b2ccf26061031a66a9dafeb41492712931) 364885

=> did not finish. It seems that the A2 cores are mostly just good at threading on FLOP. Let's switch back to 16 threads / node

[boisvert@cetuslac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-latency_probe.sh Submitted build latency_probe-2014-11-15-02-24-24 (261f12bb68dfff62b1ea59f8322176f0ced43cbc) 364961

[boisvert@cetuslac1 automated-tests]$ grep COUNTER latency_probe-2014-11-15-02-24-24.output PERFORMANCE_COUNTER type = ping-pong PERFORMANCE_COUNTER ping-action = ACTION_PING PERFORMANCE_COUNTER pong-action = ACTION_PING_REPLY PERFORMANCE_COUNTER node-count = 1024 PERFORMANCE_COUNTER worker-count-per-node = 15 PERFORMANCE_COUNTER actor-count-per-worker = 100 PERFORMANCE_COUNTER worker-count = 15360 PERFORMANCE_COUNTER actor-count = 1536000 PERFORMANCE_COUNTER ping-message-count-per-actor = 40000 PERFORMANCE_COUNTER ping-message-count = 61440000000 PERFORMANCE_COUNTER pong-message-count = 61440000000 PERFORMANCE_COUNTER message-count = 122880000000 PERFORMANCE_COUNTER elapsed-time = 1125.842059 s PERFORMANCE_COUNTER computation-throughput = 109144971.946412 messages / s PERFORMANCE_COUNTER node-throughput = 106586.886666 messages / s PERFORMANCE_COUNTER worker-throughput = 7105.792444 messages / s PERFORMANCE_COUNTER worker-latency = 140730 ns PERFORMANCE_COUNTER actor-throughput = 71.057924 messages / s PERFORMANCE_COUNTER actor-latency = 14073025 ns