Open sebhtml opened 9 years ago
JobName Goal Machine AllocationStatus Path Commit 7eded661c1321366f9b5e5aaf11c45d9c8c18c77
Toolchain Script Submission tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Native-Prairie-Soil.sh Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-02-23-17-45 (7eded661c1321366f9b5e5aaf11c45d9c8c18c77) 373501
MachineUtilization ComputationLoad [thorium] node 926 EPOCH LOAD 3530 s 2.78/15 (0.19) 0.19 0.18 0.19 0.18 0.18 0.19 0.18 0.19 0.19 0.18 0.18 0.18 0.19 0.19 0.18
RunningTime [boisvert@cetuslac1 automated-tests]$ grep TIMER spate-Iowa_Native_Prairie_Soil-2014-12-02-23-17-45.output TIMER [Load input / Count input data] 1 minutes, 0.151436 seconds TIMER [Load input / Distribute input data] 2 minutes, 24.925156 seconds TIMER [Load input] 3 minutes, 25.076584 seconds TIMER [Build assembly graph / Distribute vertices] 12 minutes, 42.108521 seconds TIMER [Build assembly graph / Distribute arcs] 29 minutes, 44.538696 seconds TIMER [Build assembly graph] 42 minutes, 26.647217 seconds
MemoryUtilization Checksum GoodComments [thorium] node 502 EPOCH TRAFFIC_REDUCTION 3530 s 0.73 0.73 0.73 0.74 0.72 0.73 0.73 0.74 0.73 0.73 0.73 0.73 0.73 0.72 0.74
BadComments NeutralComments
JobName Goal Try with an increased buffer size for multiplexer
Machine Cetus
AllocationStatus Path Commit 27121c73fc7292ab9ed4d3ec24ccee547972f12f
Toolchain Script Submission Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-03-23-15-31 (27121c73fc7292ab9ed4d3ec24ccee547972f12f) 374417
MachineUtilization ComputationLoad RunningTime [thorium] node 109 EPOCH LOAD 3520 s 2.68/15 (0.18) 0.18 0.18 0.17 0.18 0.17 0.18 0.18 0.17 0.18 0.18 0.17 0.18 0.18 0.18 0.18 [thorium] node 109 EPOCH FUTURE_TIMELINE 3520 s 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 [thorium] node 109 EPOCH WAKE_UP_COUNT 3520 s 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [thorium] node 109 EPOCH TRAFFIC_REDUCTION 3520 s 0.74 0.73 0.74 0.73 0.74 0.74 0.73 0.73 0.74 0.73 0.74 0.74 0.74 0.74 0.74 [thorium] node 109 METRICS AliveActorCount: 10261 ByteCount: 10132054016 / 17163091968 [thorium] node 109 MESSAGE_TRANSPORT ReceivedMessageCount: 23275544 SentMessageCount: 22215684 InboundThroughput: 47606.000000 messages / s OutboundThroughput: 47299.800000 messages / s [thorium] node 109 MESSAGE_QUEUES Tick: 1167086013 BufferedInboundMessageCount: 0 BufferedOutboundMessageCount: 4095 ActiveRequestCount: 20
MemoryUtilization Checksum GoodComments BadComments BufferedOutboundMessageCount is too high. needs more multiplexing.
NeutralComments
Goal Try with more actors, bigger ideal buffer size, and lower congestion threshold
Script [boisvert@miralac1 automated-tests]$ cat spate-Iowa_Native_Prairie_Soil-2014-12-04-20-40-01.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 512 \ -t 04:00:00 \ -O spate-Iowa_Native_Prairie_Soil-2014-12-04-20-40-01 \ --mode c1 \ spate-Iowa_Native_Prairie_Soil-2014-12-04-20-40-01.spate -print-thorium-data -threads-per-node 16 \ -k 27 Iowa_Native_Prairie_Soil/*.fastq -print-thorium-data \ -o spate-Iowa_Native_Prairie_Soil-2014-12-04-20-40-01
Submission [boisvert@miralac1 automated-tests]$ ./spate-Iowa_Native_Prairie_Soil-2014-12-04-20-40-01.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 375708
Deleted job because the Edison job for the same code started first. I got the info that I needed.
Goal Test new congestion detection and test message caching approach Commit 1d747471a765979d39cf119cef2a487e00c348df Script [boisvert@miralac1 automated-tests]$ cat spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 04:00:00 \ -O spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07 \ --mode c1 \ spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07.spate -print-thorium-data -threads-per-node 16 \ -k 27 Iowa_Native_Prairie_Soil/*.fastq -print-thorium-data \ -o spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07
Submission [boisvert@miralac1 automated-tests]$ ./spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 376958
[boisvert@miralac1 automated-tests]$ grep visitors spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07.output DEBUG the system has 20966400 visitors
[thorium] node 8 METRICS AliveActorCount: 20506 ByteCount: 15023996928 / 17163091968
[boisvert@miralac1 automated-tests]$ grep GRAPH spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07.output GRAPH -> 176960732906 vertices, 458929863198 vertex observations, and 168099207842 arcs.
Time [boisvert@miralac1 automated-tests]$ grep TIMER spate-Iowa_Native_Prairie_Soil-2014-12-07-04-09-07.output TIMER [Load input / Count input data] 56.471962 seconds TIMER [Load input / Distribute input data] 2 minutes, 26.991119 seconds TIMER [Load input] 3 minutes, 23.463074 seconds TIMER [Build assembly graph / Distribute vertices] 12 minutes, 39.585815 seconds TIMER [Build assembly graph / Distribute arcs] 29 minutes, 14.582764 seconds TIMER [Build assembly graph] 41 minutes, 54.168457 seconds
Goal Try with 2 in-flight messages + 2 KiB buffer size
Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-09-19-15-01 (83a85f301e9db516b9049becf21dc57bb66b19e6) 378726
[boisvert@miralac1 automated-tests]$ grep TIMER spate-Iowa_Native_Prairie_Soil-2014-12-09-19-15-01.output TIMER [Load input / Count input data] 59.498131 seconds TIMER [Load input / Distribute input data] 2 minutes, 22.637009 seconds TIMER [Load input] 3 minutes, 22.135132 seconds TIMER [Build assembly graph / Distribute vertices] 10 minutes, 56.000977 seconds TIMER [Build assembly graph / Distribute arcs] 19 minutes, 21.779907 seconds TIMER [Build assembly graph] 30 minutes, 17.780762 seconds
Summary It is faster.
What's new ? buffer size: 1KiB limit: 4 no -print-thorium-data Submission tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Native-Prairie-Soil.sh Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41 (24dc7a307905b4896e0d07b1674d50fd0080bcb3) 379527
Result
[boisvert@miralac1 automated-tests]$ grep TIMER spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41.output TIMER [Load input / Count input data] 56.779697 seconds TIMER [Load input / Distribute input data] 2 minutes, 20.801941 seconds TIMER [Load input] 3 minutes, 17.581635 seconds TIMER [Build assembly graph / Distribute vertices] 12 minutes, 5.712769 seconds
SIGABRT
There is a bug in pamid in mpich2 in the bgq-driver V1R2M2:
Program : /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41.spate
+++ID Rank: 34, TGID: 1, Core: 13, HWTID:3 TID: 355 State: RUN
0000000001548678 abort /bgsys/drivers/V1R2M2/ppc64/toolchain/gnu/glibc-2.12.2/stdlib/abort.c:77
0000000001542788 __assert_fail /bgsys/drivers/V1R2M2/ppc64/toolchain/gnu/glibc-2.12.2/assert/assert.c:81
00000000011dee94 MPIDI_RecvShortAsyncCB /bgsys/source/srcV1R2M2.3650/comm/lib/dev/mpich2/src/mpid/pamid/src/pt2pt/mpidi_callback_short.c:117
0000000001289fe0 00001736.long_branch_r2off.snprintf+0 /bgsys/source/srcV1R2M2.1830/comm/sys/buildtools/pami/p2p/protocols/send/eager/EagerSimple_packed_impl.h:324
000000000147808c _ZN4PAMI6Device2MU10RecChannel7advanceEv /bgsys/source/srcV1R2M2.1830/comm/sys/buildtools/pami/components/devices/bgq/mu2/RecChannel.h:411
0000000001479be4 00001736.long_branch_r2off.snprintf+0 /bgsys/source/srcV1R2M2.1830/comm/sys/buildtools/pami/components/devices/bgq/commthread/CommThreadWakeup.h:491
000000000147aed0 00001736.long_branch_r2off.snprintf+0 /bgsys/source/srcV1R2M2.1830/comm/sys/buildtools/pami/components/devices/bgq/commthread/CommThreadWakeup.h:243
000000000107f580 start_thread /bgsys/drivers/V1R2M2/ppc64/toolchain/gnu/glibc-2.12.2/nptl/pthread_create.c:322
000000000159018c 00007d32.long_branch_r2off.sprintf+0 :0
Ticket alcf-support #242529
I will wait for ALCF answer. Basically, we will probably remove PAMID_THREAD_MULTIPLE since it is buggy.
Goal Generate core file
Script [boisvert@miralac1 automated-tests]$ cat spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41-2.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ --env BGCOREDUMPBINARY="" \ -A CompBIO \ -n 1024 \ -t 01:00:00 \ -O spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41-2 \ --mode c1 \ spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41-2.spate -threads-per-node 16 \ -k 27 Iowa_Native_PrairieSoil/.fastq \ -o spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41-2
Submission [boisvert@miralac1 automated-tests]$ ./spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41-2.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 380117
Result Not reproducible...
[boisvert@miralac1 automated-tests]$ grep TIMER spate-Iowa_Native_Prairie_Soil-2014-12-10-20-26-41-2.output TIMER [Load input / Count input data] 58.122776 seconds TIMER [Load input / Distribute input data] 2 minutes, 19.629807 seconds TIMER [Load input] 3 minutes, 17.752594 seconds TIMER [Build assembly graph / Distribute vertices] 12 minutes, 6.385620 seconds TIMER [Build assembly graph / Distribute arcs] 19 minutes, 16.083496 seconds TIMER [Build assembly graph] 31 minutes, 22.469116 seconds
Goal Try without PAMID_THREAD_MULTIPLE
not yet
Changes no more crap in stdout symmetric stride
Submission tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Native-Prairie-Soil.sh \ Project 'compbio'; job rerouted to queue 'prod-short' Submitted build spate-Iowa_Native_Prairie_Soil-2014-12-12-03-22-05 (2cfec2cd89473a02aaef8951351d96224231be12) 380216
This did not help.
[boisvert@miralac1 automated-tests]$ grep TIMER spate-Iowa_Native_Prairie_Soil-2014-12-12-03-22-05.error TIMER [Load input / Count input data] 49.340870 seconds TIMER [Load input / Distribute input data] 2 minutes, 17.125793 seconds TIMER [Load input] 3 minutes, 6.466660 seconds TIMER [Build assembly graph / Distribute vertices] 12 minutes, 24.286377 seconds TIMER [Build assembly graph / Distribute arcs] 34 minutes, 39.112549 seconds TIMER [Build assembly graph] 47 minutes, 3.398926 seconds
JobName Goal gather runtime information added on 2014-12-01
Machine AllocationStatus Path Commit d9def37c4b07ae1525f6a45cdd619fe61e9b901f
Toolchain Script Submission [boisvert@cetuslac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Native-Prairie-Soil.sh Submitted build spate-2014-12-02-01-52-20 (d9def37c4b07ae1525f6a45cdd619fe61e9b901f) 372957
MachineUtilization ComputationLoad RunningTime there is some congestion:
[thorium] node 368 EPOCH TRAFFIC_REDUCTION 3510 s 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 [thorium] node 368 METRICS AliveActorCount: 10261 ByteCount: 10147520512 / 17163091968 [thorium] node 368 MESSAGE_TRANSPORT ReceivedMessageCount: 22866860 SentMessageCount: 21973292 InboundThroughput: 51003.601562 messages / s OutboundThroughput: 51279.601562 messages / s [thorium] node 368 MESSAGE_QUEUES Tick: 1172035455 BufferedInboundMessageCount: 0 BufferedOutboundMessageCount: 4095 ActiveRequestCount: 20
MemoryUtilization Checksum GoodComments BadComments NeutralComments