Closed sebhtml closed 9 years ago
on Cetus, it seems that not all graph stores received the message:
[boisvert@cetuslac1 automated-tests]$ grep complexity spate-2014-10-26-03-59-27.output debug biosal_assembly_graph_builder_control_complexity 15360 graph stores debug biosal_assembly_graph_builder_control_complexity 15360 graph stores
[boisvert@cetuslac1 automated-tests]$ grep "graph_Store receives " spate-2014-10-26-03-59-27.output|wc -l 29271
[boisvert@cetuslac1 automated-tests]$ grep actual spate-2014-10-26-03-59-27.output graph store synchronization: actual_kmer_count 102319934766 total_kmer_count 126844607728 [boisvert@cetuslac1 automated-tests]$ grep synchronized spate-2014-10-26-03-59-27.output|tail -n 1 synchronized_graph_stores 13911/15360
irb(main):001:0> 15360+13911 => 29271
JobName Goal Machine Beagle
AllocationStatus Path Commit Toolchain Script Submission Submitted build spate-2014-10-26-15-44-30 (1c94a2b173b4c4656cfc1f22dbc8a0cfb287b7af) 2889860.sdb
MachineUtilization ComputationLoad RunningTime Beagle) grep TIMER spate-2014-10-26-15-44-30.stdout TIMER [Load input / Count input data] 3 minutes, 37.580170 seconds TIMER [Load input / Distribute input data] 3 minutes, 13.564377 seconds TIMER [Load input] 6 minutes, 51.144531 seconds TIMER [Build assembly graph / Distribute vertices] 3 minutes, 55.399902 secondscore_manager/1020925 dies TIMER [Build assembly graph / Distribute arcs] 7 minutes, 11.654602 seconds TIMER [Build assembly graph] 11 minutes, 7.054504 seconds
MemoryUtilization Checksum Beagle) sha1sum spate-2014-10-26-15-44-30/coverage_distribution.txt-canonical 01a293db48518190038eaddbaed8a47ca0323fc7 spate-2014-10-26-15-44-30/coverage_distribution.txt-canonical
GoodComments BadComments NeutralComments
JobName Goal Machine Cetus
AllocationStatus Path Commit Toolchain Script Submission Submitted build spate-2014-10-26-20-44-36 (1c94a2b173b4c4656cfc1f22dbc8a0cfb287b7af) 352071
MachineUtilization ComputationLoad RunningTime FAILED
MemoryUtilization Checksum GoodComments BadComments synchronized_graph_stores 15360/15360 126844601886/126844607728 graph store synchronization: actual_kmer_count 126844601886 total_kmer_count 126844607728 got mismatch, will try again
irb(main):001:0> 126844607728-126844601886 => 5842
NeutralComments
JobName Goal Machine Cetus
AllocationStatus Path Commit Toolchain Script Submission Submitted build spate-2014-10-27-00-11-16 (65ee155cf79af5eeba962f147c72bb728fcc2ae3) 352130
MachineUtilization ComputationLoad RunningTime MemoryUtilization Checksum GoodComments BadComments synchronized_graph_stores 15360/15360 126844605303/126844607728 graph store synchronization: actual_kmer_count 126844605303 total_kmer_count 126844607728 got mismatch, will try again
irb(main):002:0> 126844607728 - 126844605303 => 2425
NeutralComments
Last known working code on Cetus #747 540b436
JobName Goal test if there is a bug in the multiplexer that can explain why there are 2000 kmers missing.
Machine AllocationStatus Path Commit 65ee155cf79af5eeba962f147c72bb728fcc2ae3 with a timeout of 0 for multiplexer
Toolchain Script Submission [boisvert@cetuslac1 biosal-tests]$ ./soil-779-1024x16-1.sh 352442
MachineUtilization
ComputationLoad
RunningTime
[boisvert@cetuslac1 biosal-tests]$ grep TIMER soil-779-1024x16-1.output
TIMER [Load input / Count input data] 51.858780 seconds
TIMER [Load input / Distribute input data] 1 minutes, 45.178535 seconds
TIMER [Load input] 2 minutes, 37.037308 seconds
TIMER [Build assembly graph / Distribute vertices] 11 minutes, 31.577698 seconds
MemoryUtilization Checksum [boisvert@cetuslac1 biosal-tests]$ sha1sum soil-779-1024x16-1/coverage_distribution.txt-canonical 01a293db48518190038eaddbaed8a47ca0323fc7 soil-779-1024x16-1/coverage_distribution.txt-canonical
GoodComments BadComments graph store synchronization: actual_kmer_count 126792901245 total_kmer_count 126844607728 got mismatch, will try again
synchronized_graph_stores 15132/15360 124970090910/126844607728
and then it seems that the coverage distribution was spawned...
biosal_assembly_graph_store/1037883 will use coverage distribution 1071100
(?) NeutralComments
according to ALCF support:
"The stdout and stderr are channeled through the runjob process and we have seen some rare occurrences when the load is high that data might be dropped."
.
JobName Goal test without multiplexer
Machine Cetus
AllocationStatus Path Commit Toolchain Script
[boisvert@cetuslac1 automated-tests]$ cat spate-2014-10-27-17-08-34.sh
#!/bin/bash
# echo "Commit= b3592bc026c74ef95afd35fa3fb2216225c38df3"
qsub \
--env PAMID_THREAD_MULTIPLE=1 \
-A CompBIO \
-n 1024 \
-t 01:00:00 \
-O spate-2014-10-27-17-08-34 \
--mode c1 \
spate -print-load -threads-per-node 16 \
-k 43 Iowa_Continuous_Corn/*.fastq \
-o spate-2014-10-27-17-08-34
Submission Submitted build spate-2014-10-27-17-08-34 (b3592bc026c74ef95afd35fa3fb2216225c38df3) 352554
MachineUtilization ComputationLoad RunningTime [boisvert@cetuslac1 automated-tests]$ grep TIMER spate-2014-10-27-17-08-34.output TIMER [Load input / Count input data] 41.159801 seconds TIMER [Load input / Distribute input data] 1 minutes, 47.970085 seconds TIMER [Load input] 2 minutes, 29.129898 seconds TIMER [Build assembly graph / Distribute vertices] 11 minutes, 21.300964 seconds TIMER [Build assembly graph / Distribute arcs] 21 minutes, 51.119995 seconds TIMER [Build assembly graph] 33 minutes, 12.420898 seconds
MemoryUtilization Checksum GoodComments BadComments NeutralComments
on beagle:
5632 graph stores Beagle) grep "graph_Store receives " spate-2014-10-26-10-38-13.stdout|wc -l 11264 irb(main):001:0> 5632 * 2 => 11264
So all graph stores received the message.