GeneAssembly / biosal

biosal is a distributed BIOlogical Sequence Actor Library. THIS IS A MIRROR.
BSD 2-Clause "Simplified" License
6 stars 1 forks source link

Run Spate on Iowa Native Prairie on Edison #830

Open sebhtml opened 10 years ago

sebhtml commented 10 years ago

data specs:

sebhtml commented 10 years ago

boisvert@edison12:/scratch2/scratchdirs/boisvert/Data> lfs setstripe -c 100 . boisvert@edison12:/scratch2/scratchdirs/boisvert/Data> pwd /scratch2/scratchdirs/boisvert/Data

sebhtml commented 10 years ago

[boisvert@cetuslac1 Iowa_Native_Prairie_Soil]$ pwd /gpfs/mira-fs1/projects/CompBIO/Datasets/JGI/Great_Prairie_Soil_Metagenome_Grand_Challenge/Datasets/Iowa_Native_Prairie_Soil [boisvert@cetuslac1 Iowa_Native_Prairie_Soil]$ du -sh . 709G .

sebhtml commented 10 years ago

Transfer started.

boisvert@edison12:/project/projectdirs/m1523/Data> pwd /project/projectdirs/m1523/Data boisvert@edison12:/project/projectdirs/m1523/Data> ls Iowa_Continuous_Corn Iowa_Native_Prairie_Soil boisvert@edison12:/project/projectdirs/m1523/Data> ls -l total 256 drwxr-s--- 2 boisvert m1523 131072 21 nov 04:15 Iowa_Continuous_Corn lrwxrwxrwx 1 boisvert m1523 60 27 nov 09:40 Iowa_Native_Prairie_Soil -> /scratch2/scratchdirs/boisvert/Data/Iowa_Native_Prairie_Soil

sebhtml commented 10 years ago

JobName Goal first assembly on Iowa Prairie

Machine Edison

AllocationStatus boisvert@edison12:/project/projectdirs/m1523/Jobs> getnim -U$(whoami) m1523 750595.42 ACTV

Path /project/projectdirs/m1523/Jobs

Commit e3a40915709

86ca3ed049e

Toolchain Intel

Script boisvert@edison12:/project/projectdirs/m1523/Jobs> cat spate-iowa-prairie-edison-256x24-2014-11-28-1.pbs

!/bin/bash

PBS -N spate-iowa-prairie-edison-256x24-2014-11-28-1

PBS -A m1523

PBS -l walltime=4:00:00

PBS -l mppwidth=6144

PBS -q regular

cd $PBS_O_WORKDIR

export MPICH_NEMESIS_ASYNC_PROGRESS=1 export MPICH_MAX_THREAD_SAFETY=multiple

aprun -n 256 -N 1 -d 23 -r 1 \ ./spate-iowa-prairie-edison-256x24-2014-11-28-1.spate -threads-per-node 23 -print-load \ -k 33 Iowa_Native_Prairie_Soil/*.fastq -o spate-iowa-prairie-edison-256x24-2014-11-28-1 \

spate-iowa-prairie-edison-256x24-2014-11-28-1.stdout

Submission boisvert@edison12:/project/projectdirs/m1523/Jobs> qsub spate-iowa-prairie-edison-256x24-2014-11-28-1.pbs
2112142.edique02

MachineUtilization ComputationLoad RunningTime boisvert@edison12:/project/projectdirs/m1523/Jobs> grep TIMER spate-iowa-prairie-edison-256x24-2014-11-28-1.stdout TIMER [Load input / Count input data] 1 minutes, 1.038311 seconds TIMER [Load input / Distribute input data] 57.715984 seconds TIMER [Load input] 1 minutes, 58.754295 seconds TIMER [Build assembly graph / Distribute vertices] 3 minutes, 7.677612 secondscore_manager/1022461 dies TIMER [Build assembly graph / Distribute arcs] 6 minutes, 36.023254 seconds TIMER [Build assembly graph] 9 minutes, 43.700867 seconds

MemoryUtilization thorium_node: node/124 METRICS AliveActorCount: 2597 ByteCount: 19512586240 / 67657900032

Checksum GoodComments BadComments NeutralComments

sebhtml commented 9 years ago

JobName Goal do a job with a lot of runtime information

Machine AllocationStatus Path Commit 0dfbbebe0d41df6ca8e398843f74fc002bdeb099

Toolchain Intel

Script Submission ./tests/Edison_Cray_XC30/launch-Spate-Iowa-Native-Prairie-Soil.sh Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-01-18-10-59 (0dfbbebe0d41df6ca8e398843f74fc002bdeb099) 2119120.edique02

MachineUtilization ComputationLoad RunningTime aprun: file spate-Iowa_Native_Prairie_Soil-2014-12-01-18-10-59.spate not found

MemoryUtilization Checksum GoodComments BadComments NeutralComments

sebhtml commented 9 years ago

JobName Goal do a job with a lot of runtime information Machine Edison

AllocationStatus m1523 565032.27 ACTV

Path

Commit df3055b91857069d26425586c269f57117c6177a Toolchain Intel Script boisvert@edison12:/project/projectdirs/m1523/Jobs> cat spate-Iowa_Native_Prairie_Soil-2014-12-02-08-12-58.pbs

!/bin/bash

PBS -N spate-Iowa_Native_Prairie_Soil-2014-12-02-08-12-58

PBS -A m1523

PBS -l walltime=3:00:00

PBS -l mppwidth=6144

PBS -q regular

256 * 24 = 6144

cd $PBS_O_WORKDIR export MPICH_NEMESIS_ASYNC_PROGRESS=1 export MPICH_MAX_THREAD_SAFETY=multiple export CRAY_MALLOPT_OFF=1

echo "Commit= df3055b91857069d26425586c269f57117c6177a"

aprun -n 256 -N 1 -d 23 -r 1 \ ./spate-Iowa_Native_Prairie_Soil-2014-12-02-08-12-58.spate -threads-per-node 23 -print-thorium-data \ -k 27 Iowa_Native_Prairie_Soil/*.fastq -o spate-Iowa_Native_Prairie_Soil-2014-12-02-08-12-58 > spate-Iowa_Native_Prairie_Soil-2014-12-02-08-12-58.stdout

Submission Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-02-08-12-58 (df3055b91857069d26425586c269f57117c6177a) 2121622.edique02

MachineUtilization ComputationLoad boisvert@edison12:/project/projectdirs/m1523/Jobs> grep LOAD spate-Iowa_Native_Prairie_Soil-2014-12-02-08-12-58.stdout |tail -n4 [thorium] node 38 EPOCH LOAD 10820 s 0.72/22 (0.03) 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.05 0.03 0.03 [thorium] node 87 EPOCH LOAD 10820 s 0.83/22 (0.04) 0.04 0.03 0.04 0.04 0.04 0.03 0.04 0.04 0.04 0.03 0.03 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 [thorium] node 211 EPOCH LOAD 10820 s 0.84/22 (0.04) 0.04 0.04 0.04 0.04 0.04 0.04 0.03 0.04 0.03 0.04 0.03 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 [thorium] node 16 EPOCH LOAD 10820 s 0.86/22 (0.04) 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04

RunningTime boisvert@edison12:/project/projectdirs/m1523/Jobs> grep TIMER spate-Iowa_Native_Prairie_Soil-2014-12-02-08-12-58.stdout TIMER [Load input / Count input data] 1 minutes, 2.942043 seconds TIMER [Load input / Distribute input data] 1 minutes, 2.831348 seconds TIMER [Load input] 2 minutes, 5.773392 seconds TIMER [Build assembly graph / Distribute vertices] 3 minutes, 15.937500 seconds TIMER [Build assembly graph / Distribute arcs] 6 minutes, 32.063751 seconds TIMER [Build assembly graph] 9 minutes, 48.001221 seconds

MemoryUtilization [thorium] node 38 METRICS AliveActorCount: 2597 ByteCount: 17485611008 / 67657900032 [thorium] node 87 METRICS AliveActorCount: 2597 ByteCount: 17387225088 / 67657900032 [thorium] node 211 METRICS AliveActorCount: 2597 ByteCount: 17347649536 / 67657900032 [thorium] node 16 METRICS AliveActorCount: 2597 ByteCount: 17496276992 / 67657900032

Checksum GoodComments biosal_unitig_visitor/1039353 visited 73000 vertices so far. velocity: 13.513514 vertices / s, 117.648651 received messages / s, 117.648651 sent messages / s

[thorium] node 16 EPOCH TRAFFIC_REDUCTION 10820 s 0.10 0.12 0.11 0.11 0.10 0.10 0.11 0.10 0.10 0.12 0.11 0.10 0.11 0.11 0.10 0.10 0.10 0.11 0.11 0.11 0.10 0.09

boisvert@edison12:/project/projectdirs/m1523/Jobs> grep GRAPH spate-Iowa_Native_Prairie_Soil-2014-12-02-08-12-58.stdout GRAPH -> 176960732906 vertices, 458929863198 vertex observations, and 176100436779 arcs.

boisvert@edison12:/project/projectdirs/m1523/Jobs> grep visitors spate-Iowa_Native_Prairie_Soil-2014-12-02-08-12-58.stdout DEBUG the system has 653312 visitors

spate/1000960 has 5632 graph stores

BadComments not enough traffic reduction

NeutralComments

sebhtml commented 9 years ago

JobName Goal Try the new fancy dynamic timeout

Machine Edison

AllocationStatus Path Commit 27121c73fc7292ab9ed4d3ec24ccee547972f12f

Toolchain Script Submission ./tests/Edison_Cray_XC30/launch-Spate-Iowa-Native-Prairie-Soil.sh Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-03-15-14-07 (27121c73fc7292ab9ed4d3ec24ccee547972f12f) 2130439.edique02

MachineUtilization ComputationLoad RunningTime boisvert@edison12:/project/projectdirs/m1523/Jobs> grep TIMER spate-Iowa_Native_Prairie_Soil-2014-12-03-15-14-07.stdout TIMER [Load input / Count input data] 1 minutes, 2.443436 seconds TIMER [Load input / Distribute input data] 1 minutes, 1.154144 seconds TIMER [Load input] 2 minutes, 3.597580 seconds TIMER [Build assembly graph / Distribute vertices] 3 minutes, 17.623886 seconds TIMER [Build assembly graph / Distribute arcs] 6 minutes, 33.651123 seconds TIMER [Build assembly graph] 9 minutes, 51.275024 seconds

MemoryUtilization Checksum GoodComments biosal_unitig_visitor/1039632 visited 64500 vertices so far. velocity: 6.329114 vertices / s, 54.898735 received messages / s, 54.898735 sent messages / s

BadComments 2 classes of nodes:

[thorium] node 52 EPOCH LOAD 10810 s 0.98/22 (0.04) 0.04 0.04 0.04 0.05 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.06 0.04 0.04 0.04 0.04 0.04 [thorium] node 52 EPOCH FUTURE_TIMELINE 10810 s 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [thorium] node 52 EPOCH WAKE_UP_COUNT 10810 s 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [thorium] node 52 EPOCH TRAFFIC_REDUCTION 10810 s 0.53 0.55 0.54 0.52 0.52 0.54 0.53 0.53 0.52 0.53 0.52 0.51 0.52 0.51 0.50 0.52 0.53 0.52 0.50 0.51 0.52 0.53 [thorium] node 52 METRICS AliveActorCount: 2597 ByteCount: 17658482688 / 67657900032 [thorium] node 52 MESSAGE_TRANSPORT ReceivedMessageCount: 1728057738 SentMessageCount: 1755104177 InboundThroughput: 169701.600000 messages / s OutboundThroughput: 172617.000000 messages / s [thorium] node 52 MESSAGE_QUEUES Tick: 1340462500 BufferedInboundMessageCount: 0 BufferedOutboundMessageCount: 1194 ActiveRequestCount: 22

[thorium] node 167 EPOCH LOAD 10800 s 3.57/22 (0.16) 0.17 0.16 0.14 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.16 0.15 0.15 0.14 0.16 0.16 0.17 0.17 0.16 0.16 0.17 [thorium] node 167 EPOCH FUTURE_TIMELINE 10800 s 14 26 13 25 18 15 29 16 24 22 16 24 25 15 0 19 20 23 18 19 18 19 [thorium] node 167 EPOCH WAKE_UP_COUNT 10800 s 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [thorium] node 167 EPOCH TRAFFIC_REDUCTION 10800 s 0.32 0.33 0.32 0.32 0.32 0.31 0.31 0.32 0.32 0.33 0.33 0.31 0.32 0.31 0.30 0.31 0.31 0.31 0.30 0.31 0.30 0.33 [thorium] node 167 METRICS AliveActorCount: 2597 ByteCount: 16892043264 / 67657900032 [thorium] node 167 MESSAGE_TRANSPORT ReceivedMessageCount: 2250266081 SentMessageCount: 2291460108 InboundThroughput: 221841.800000 messages / s OutboundThroughput: 225543.000000 messages / s [thorium] node 167 MESSAGE_QUEUES Tick: 2383521225 BufferedInboundMessageCount: 0 BufferedOutboundMessageCount: 619 ActiveRequestCount: 19

NeutralComments

sebhtml commented 9 years ago

Goal Try with more actors, bigger ideal buffer size, and lower congestion threshold

Script boisvert@edison12:/project/projectdirs/m1523/Jobs> cat spate-Iowa_Native_Prairie_Soil-2014-12-04-13-03-20.pbs

!/bin/bash

PBS -N spate-Iowa_Native_Prairie_Soil-2014-12-04-13-03-20

PBS -A m1523

PBS -l walltime=3:00:00

PBS -l mppwidth=6144

PBS -q regular

256 * 24 = 6144

cd $PBS_O_WORKDIR export MPICH_NEMESIS_ASYNC_PROGRESS=1 export MPICH_MAX_THREAD_SAFETY=multiple export CRAY_MALLOPT_OFF=1

echo "Commit= 0d480b7508a034e8480faef84ea7e27394dcb483"

aprun -n 256 -N 1 -d 23 -r 1 \ ./spate-Iowa_Native_Prairie_Soil-2014-12-04-13-03-20.spate -threads-per-node 23 -print-thorium-data \ -k 27 Iowa_Native_Prairie_Soil/*.fastq -o spate-Iowa_Native_Prairie_Soil-2014-12-04-13-03-20 > spate-Iowa_Native_Prairie_Soil-2014-12-04-13-03-20.stdout

Submission 256x24 ./tests/Edison_Cray_XC30/launch-Spate-Iowa-Native-Prairie-Soil.sh Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-04-13-03-20 (0d480b7508a034e8480faef84ea7e27394dcb483) 2136938.edique02

Time doubling buffer size increases latency because cores are being occupied for too long

boisvert@edison12:/project/projectdirs/m1523/Jobs> grep TIMER spate-Iowa_Native_Prairie_Soil-2014-12-04-13-03-20.stdout TIMER [Load input / Count input data] 1 minutes, 3.237560 seconds TIMER [Load input / Distribute input data] 57.761036 seconds TIMER [Load input] 2 minutes, 0.998596 seconds TIMER [Build assembly graph / Distribute vertices] 5 minutes, 19.892883 seconds TIMER [Build assembly graph / Distribute arcs] 10 minutes, 22.068970 seconds TIMER [Build assembly graph] 15 minutes, 41.961853 seconds

sebhtml commented 9 years ago

Goal Test better congestion detection and test message caching Version 1d747471a765979d39cf119cef2a487e00c348df Submission Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-06-20-09-20 (1d747471a765979d39cf119cef2a487e00c348df) 2148223.edique02

boisvert@edison12:/project/projectdirs/m1523/Jobs> grep visitors spate-Iowa_Native_Prairie_Soil-2014-12-06-20-09-20.stdout DEBUG the system has 1306624 visitors

boisvert@edison12:/project/projectdirs/m1523/Jobs> grep GRAPH spate-Iowa_Native_Prairie_Soil-2014-12-06-20-09-20.stdout GRAPH -> 176960732906 vertices, 458929863198 vertex observations, and 176102844130 arcs.

irb(main):004:0> 75130880000.0/88480366453 => 0.8491248738205539

84% (that's an improvement !)

boisvert@edison12:/project/projectdirs/m1523/Jobs> getnim -U$(whoami) m1523 453892.11 ACTV

There are 2 classes of nodes: low load and high load

Low load:

[thorium] node 150 SUMMARY Tick: 1410410797 (10086.400000 Hz) [thorium] node 150 EPOCH LOAD 10821 s 2.94/22 (0.13) 0.14 0.14 0.13 0.14 0.14 0.14 0.14 0.14 0.14 0.13 0.13 0.13 0.13 0.13 0.13 0.13 0.13 0.15 0.13 0.13 0.13 0.13 [thorium] node 150 EPOCH FUTURE_TIMELINE 10821 s 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [thorium] node 150 EPOCH TRAFFIC_REDUCTION 10821 s 0.91 0.91 0.90 0.91 0.92 0.91 0.92 0.92 0.92 0.91 0.92 0.92 0.91 0.89 0.91 0.91 0.91 0.91 0.90 0.92 0.91 0.92 [thorium] node 150 METRICS AliveActorCount: 5149 ByteCount: 18642685952 / 67657900032 [thorium] node 150 TRANSPORT ReceivedMessageCount: 1031647365 SentMessageCount: 2060736450 InboundThroughput: 99115.200000 messages / s OutboundThroughput: 201615.400000 messages / s [thorium] node 150 MESSAGING BufferedInboundMessageCount: 0 BufferedOutboundMessageCountInRing: 262 BufferedOutboundMessageCountInQueue: 0 MessageCountInTransport: 44

High load:

[thorium] node 250 SUMMARY Tick: 2427301896 (11510.000000 Hz) [thorium] node 250 EPOCH LOAD 10820 s 5.82/22 (0.26) 0.28 0.28 0.27 0.30 0.29 0.29 0.29 0.27 0.27 0.27 0.27 0.23 0.23 0.24 0.25 0.27 0.27 0.27 0.26 0.24 0.24 0.24 [thorium] node 250 EPOCH FUTURE_TIMELINE 10820 s 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [thorium] node 250 EPOCH TRAFFIC_REDUCTION 10820 s 0.88 0.89 0.89 0.89 0.89 0.88 0.88 0.87 0.88 0.88 0.89 0.89 0.89 0.88 0.88 0.89 0.89 0.88 0.88 0.88 0.88 0.88 [thorium] node 250 METRICS AliveActorCount: 5149 ByteCount: 18196049920 / 67657900032 [thorium] node 250 TRANSPORT ReceivedMessageCount: 2670474653 SentMessageCount: 2282128047 InboundThroughput: 258121.500000 messages / s OutboundThroughput: 228574.200000 messages / s [thorium] node 250 MESSAGING BufferedInboundMessageCount: 0 BufferedOutboundMessageCountInRing: 178 BufferedOutboundMessageCountInQueue: 0 MessageCountInTransport: 35

sebhtml commented 9 years ago

Goal Test new algorithm that focuses on graph locality Submission Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-08-19-39-03 (f9e3f0089d1a69e065604631805aaec858079b7a) 2156449.edique02

found a balance issue.

273521ee78ed7981e1fe418b5ff3e16192124330

sebhtml commented 9 years ago

Goal Try with symmetric stride no more crap in stdout

Submission ./tests/Edison_Cray_XC30/launch-Spate-Iowa-Native-Prairie-Soil.sh Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-11-19-21-57 (2cfec2cd89473a02aaef8951351d96224231be12) 2171684.edique02

TODO