GeneAssembly / biosal

biosal is a distributed BIOlogical Sequence Actor Library. THIS IS A MIRROR.
BSD 2-Clause "Simplified" License
6 stars 1 forks source link

run Spate on Iowa Native Prairie on Cetus or Mira 1024x16 #832

Open sebhtml opened 9 years ago

sebhtml commented 9 years ago

JobName Goal gather runtime information added on 2014-12-01

Machine AllocationStatus Path Commit d9def37c4b07ae1525f6a45cdd619fe61e9b901f

Toolchain Script Submission [boisvert@cetuslac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Native-Prairie-Soil.sh Submitted build spate-2014-12-02-01-52-20 (d9def37c4b07ae1525f6a45cdd619fe61e9b901f) 372957

MachineUtilization ComputationLoad RunningTime there is some congestion:

[thorium] node 368 EPOCH TRAFFIC_REDUCTION 3510 s 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 [thorium] node 368 METRICS AliveActorCount: 10261 ByteCount: 10147520512 / 17163091968 [thorium] node 368 MESSAGE_TRANSPORT ReceivedMessageCount: 22866860 SentMessageCount: 21973292 InboundThroughput: 51003.601562 messages / s OutboundThroughput: 51279.601562 messages / s [thorium] node 368 MESSAGE_QUEUES Tick: 1172035455 BufferedInboundMessageCount: 0 BufferedOutboundMessageCount: 4095 ActiveRequestCount: 20

MemoryUtilization Checksum GoodComments BadComments NeutralComments

sebhtml commented 9 years ago

JobName Goal Machine AllocationStatus Path Commit 7eded661c1321366f9b5e5aaf11c45d9c8c18c77

Toolchain Script Submission tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Native-Prairie-Soil.sh Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-02-23-17-45 (7eded661c1321366f9b5e5aaf11c45d9c8c18c77) 373501

MachineUtilization ComputationLoad [thorium] node 926 EPOCH LOAD 3530 s 2.78/15 (0.19) 0.19 0.18 0.19 0.18 0.18 0.19 0.18 0.19 0.19 0.18 0.18 0.18 0.19 0.19 0.18

RunningTime [boisvert@cetuslac1 automated-tests]$ grep TIMER spate-Iowa_Native_Prairie_Soil-2014-12-02-23-17-45.output TIMER [Load input / Count input data] 1 minutes, 0.151436 seconds TIMER [Load input / Distribute input data] 2 minutes, 24.925156 seconds TIMER [Load input] 3 minutes, 25.076584 seconds TIMER [Build assembly graph / Distribute vertices] 12 minutes, 42.108521 seconds TIMER [Build assembly graph / Distribute arcs] 29 minutes, 44.538696 seconds TIMER [Build assembly graph] 42 minutes, 26.647217 seconds

MemoryUtilization Checksum GoodComments [thorium] node 502 EPOCH TRAFFIC_REDUCTION 3530 s 0.73 0.73 0.73 0.74 0.72 0.73 0.73 0.74 0.73 0.73 0.73 0.73 0.73 0.72 0.74

BadComments NeutralComments

sebhtml commented 9 years ago

JobName Goal Try with an increased buffer size for multiplexer

Machine Cetus

AllocationStatus Path Commit 27121c73fc7292ab9ed4d3ec24ccee547972f12f

Toolchain Script Submission Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-03-23-15-31 (27121c73fc7292ab9ed4d3ec24ccee547972f12f) 374417

MachineUtilization ComputationLoad RunningTime [thorium] node 109 EPOCH LOAD 3520 s 2.68/15 (0.18) 0.18 0.18 0.17 0.18 0.17 0.18 0.18 0.17 0.18 0.18 0.17 0.18 0.18 0.18 0.18 [thorium] node 109 EPOCH FUTURE_TIMELINE 3520 s 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 [thorium] node 109 EPOCH WAKE_UP_COUNT 3520 s 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [thorium] node 109 EPOCH TRAFFIC_REDUCTION 3520 s 0.74 0.73 0.74 0.73 0.74 0.74 0.73 0.73 0.74 0.73 0.74 0.74 0.74 0.74 0.74 [thorium] node 109 METRICS AliveActorCount: 10261 ByteCount: 10132054016 / 17163091968 [thorium] node 109 MESSAGE_TRANSPORT ReceivedMessageCount: 23275544 SentMessageCount: 22215684 InboundThroughput: 47606.000000 messages / s OutboundThroughput: 47299.800000 messages / s [thorium] node 109 MESSAGE_QUEUES Tick: 1167086013 BufferedInboundMessageCount: 0 BufferedOutboundMessageCount: 4095 ActiveRequestCount: 20

MemoryUtilization Checksum GoodComments BadComments BufferedOutboundMessageCount is too high. needs more multiplexing.

NeutralComments

sebhtml commented 9 years ago

Goal Try with more actors, bigger ideal buffer size, and lower congestion threshold

Script [boisvert@miralac1 automated-tests]$ cat spate-Iowa_Native_Prairie_Soil-2014-12-04-20-40-01.sh

!/bin/bash

echo "Commit= 0d480b7508a034e8480faef84ea7e27394dcb483"

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 512 \ -t 04:00:00 \ -O spate-Iowa_Native_Prairie_Soil-2014-12-04-20-40-01 \ --mode c1 \ spate-Iowa_Native_Prairie_Soil-2014-12-04-20-40-01.spate -print-thorium-data -threads-per-node 16 \ -k 27 Iowa_Native_Prairie_Soil/*.fastq -print-thorium-data \ -o spate-Iowa_Native_Prairie_Soil-2014-12-04-20-40-01

Submission [boisvert@miralac1 automated-tests]$ ./spate-Iowa_Native_Prairie_Soil-2014-12-04-20-40-01.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 375708

Deleted job because the Edison job for the same code started first. I got the info that I needed.

sebhtml commented 9 years ago

Goal Test new congestion detection and test message caching approach Commit 1d747471a765979d39cf119cef2a487e00c348df Script [boisvert@miralac1 automated-tests]$ cat spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07.sh

!/bin/bash

echo "Commit= 1d747471a765979d39cf119cef2a487e00c348df"

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 04:00:00 \ -O spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07 \ --mode c1 \ spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07.spate -print-thorium-data -threads-per-node 16 \ -k 27 Iowa_Native_Prairie_Soil/*.fastq -print-thorium-data \ -o spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07

Submission [boisvert@miralac1 automated-tests]$ ./spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 376958

[boisvert@miralac1 automated-tests]$ grep visitors spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07.output DEBUG the system has 20966400 visitors

[thorium] node 8 METRICS AliveActorCount: 20506 ByteCount: 15023996928 / 17163091968

[boisvert@miralac1 automated-tests]$ grep GRAPH spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07.output GRAPH -> 176960732906 vertices, 458929863198 vertex observations, and 168099207842 arcs.

Time [boisvert@miralac1 automated-tests]$ grep TIMER spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07.output TIMER [Load input / Count input data] 56.471962 seconds TIMER [Load input / Distribute input data] 2 minutes, 26.991119 seconds TIMER [Load input] 3 minutes, 23.463074 seconds TIMER [Build assembly graph / Distribute vertices] 12 minutes, 39.585815 seconds TIMER [Build assembly graph / Distribute arcs] 29 minutes, 14.582764 seconds TIMER [Build assembly graph] 41 minutes, 54.168457 seconds

sebhtml commented 9 years ago

Goal Try with 2 in-flight messages + 2 KiB buffer size

Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-09-19-15-01 (83a85f301e9db516b9049becf21dc57bb66b19e6) 378726

[boisvert@miralac1 automated-tests]$ grep TIMER spate-Iowa_Native_Prairie_Soil-2014-12-09-19-15-01.output TIMER [Load input / Count input data] 59.498131 seconds TIMER [Load input / Distribute input data] 2 minutes, 22.637009 seconds TIMER [Load input] 3 minutes, 22.135132 seconds TIMER [Build assembly graph / Distribute vertices] 10 minutes, 56.000977 seconds TIMER [Build assembly graph / Distribute arcs] 19 minutes, 21.779907 seconds TIMER [Build assembly graph] 30 minutes, 17.780762 seconds

Summary It is faster.

sebhtml commented 9 years ago

What's new ? buffer size: 1KiB limit: 4 no -print-thorium-data Submission tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Native-Prairie-Soil.sh Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41 (24dc7a307905b4896e0d07b1674d50fd0080bcb3) 379527

Result

[boisvert@miralac1 automated-tests]$ grep TIMER spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41.output TIMER [Load input / Count input data] 56.779697 seconds TIMER [Load input / Distribute input data] 2 minutes, 20.801941 seconds TIMER [Load input] 3 minutes, 17.581635 seconds TIMER [Build assembly graph / Distribute vertices] 12 minutes, 5.712769 seconds

SIGABRT

There is a bug in pamid in mpich2 in the bgq-driver V1R2M2:


Program : /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41.spate


+++ID Rank: 34, TGID: 1, Core: 13, HWTID:3 TID: 355 State: RUN

0000000001548678 abort /bgsys/drivers/V1R2M2/ppc64/toolchain/gnu/glibc-2.12.2/stdlib/abort.c:77

0000000001542788 __assert_fail /bgsys/drivers/V1R2M2/ppc64/toolchain/gnu/glibc-2.12.2/assert/assert.c:81

00000000011dee94 MPIDI_RecvShortAsyncCB /bgsys/source/srcV1R2M2.3650/comm/lib/dev/mpich2/src/mpid/pamid/src/pt2pt/mpidi_callback_short.c:117

0000000001289fe0 00001736.long_branch_r2off.snprintf+0 /bgsys/source/srcV1R2M2.1830/comm/sys/buildtools/pami/p2p/protocols/send/eager/EagerSimple_packed_impl.h:324

000000000147808c _ZN4PAMI6Device2MU10RecChannel7advanceEv /bgsys/source/srcV1R2M2.1830/comm/sys/buildtools/pami/components/devices/bgq/mu2/RecChannel.h:411

0000000001479be4 00001736.long_branch_r2off.snprintf+0 /bgsys/source/srcV1R2M2.1830/comm/sys/buildtools/pami/components/devices/bgq/commthread/CommThreadWakeup.h:491

000000000147aed0 00001736.long_branch_r2off.snprintf+0 /bgsys/source/srcV1R2M2.1830/comm/sys/buildtools/pami/components/devices/bgq/commthread/CommThreadWakeup.h:243

000000000107f580 start_thread /bgsys/drivers/V1R2M2/ppc64/toolchain/gnu/glibc-2.12.2/nptl/pthread_create.c:322

000000000159018c 00007d32.long_branch_r2off.sprintf+0 :0

Ticket alcf-support #242529

sebhtml commented 9 years ago

I will wait for ALCF answer. Basically, we will probably remove PAMID_THREAD_MULTIPLE since it is buggy.

sebhtml commented 9 years ago

Goal Generate core file

Script [boisvert@miralac1 automated-tests]$ cat spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41-2.sh

!/bin/bash

echo "Commit= 24dc7a307905b4896e0d07b1674d50fd0080bcb3"

-print-thorium-data

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ --env BGCOREDUMPBINARY="" \ -A CompBIO \ -n 1024 \ -t 01:00:00 \ -O spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41-2 \ --mode c1 \ spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41-2.spate -threads-per-node 16 \ -k 27 Iowa_Native_PrairieSoil/.fastq \ -o spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41-2

Submission [boisvert@miralac1 automated-tests]$ ./spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41-2.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 380117

Result Not reproducible...

[boisvert@miralac1 automated-tests]$ grep TIMER spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41-2.output TIMER [Load input / Count input data] 58.122776 seconds TIMER [Load input / Distribute input data] 2 minutes, 19.629807 seconds TIMER [Load input] 3 minutes, 17.752594 seconds TIMER [Build assembly graph / Distribute vertices] 12 minutes, 6.385620 seconds TIMER [Build assembly graph / Distribute arcs] 19 minutes, 16.083496 seconds TIMER [Build assembly graph] 31 minutes, 22.469116 seconds

sebhtml commented 9 years ago

Goal Try without PAMID_THREAD_MULTIPLE

not yet

sebhtml commented 9 years ago

Changes no more crap in stdout symmetric stride

Submission tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Native-Prairie-Soil.sh \ Project 'compbio'; job rerouted to queue 'prod-short' Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-12-03-22-05 (2cfec2cd89473a02aaef8951351d96224231be12) 380216

This did not help.

[boisvert@miralac1 automated-tests]$ grep TIMER spate-Iowa_Native_Prairie_Soil-2014-12-12-03-22-05.error TIMER [Load input / Count input data] 49.340870 seconds TIMER [Load input / Distribute input data] 2 minutes, 17.125793 seconds TIMER [Load input] 3 minutes, 6.466660 seconds TIMER [Build assembly graph / Distribute vertices] 12 minutes, 24.286377 seconds TIMER [Build assembly graph / Distribute arcs] 34 minutes, 39.112549 seconds TIMER [Build assembly graph] 47 minutes, 3.398926 seconds