GeneAssembly / biosal

biosal is a distributed BIOlogical Sequence Actor Library. THIS IS A MIRROR.
BSD 2-Clause "Simplified" License
6 stars 1 forks source link

bug: synchronization issue for vertices on cetus and beagle #779

Closed sebhtml closed 9 years ago

sebhtml commented 9 years ago

on beagle:

5632 graph stores Beagle) grep "graph_Store receives " spate-2014-10-26-10-38-13.stdout|wc -l 11264 irb(main):001:0> 5632 * 2 => 11264

So all graph stores received the message.

sebhtml commented 9 years ago

on Cetus, it seems that not all graph stores received the message:

[boisvert@cetuslac1 automated-tests]$ grep complexity spate-2014-10-26-03-59-27.output debug biosal_assembly_graph_builder_control_complexity 15360 graph stores debug biosal_assembly_graph_builder_control_complexity 15360 graph stores

[boisvert@cetuslac1 automated-tests]$ grep "graph_Store receives " spate-2014-10-26-03-59-27.output|wc -l 29271

[boisvert@cetuslac1 automated-tests]$ grep actual spate-2014-10-26-03-59-27.output graph store synchronization: actual_kmer_count 102319934766 total_kmer_count 126844607728 [boisvert@cetuslac1 automated-tests]$ grep synchronized spate-2014-10-26-03-59-27.output|tail -n 1 synchronized_graph_stores 13911/15360

irb(main):001:0> 15360+13911 => 29271

sebhtml commented 9 years ago

JobName Goal Machine Beagle

AllocationStatus Path Commit Toolchain Script Submission Submitted build spate-2014-10-26-15-44-30 (1c94a2b173b4c4656cfc1f22dbc8a0cfb287b7af) 2889860.sdb

MachineUtilization ComputationLoad RunningTime Beagle) grep TIMER spate-2014-10-26-15-44-30.stdout TIMER [Load input / Count input data] 3 minutes, 37.580170 seconds TIMER [Load input / Distribute input data] 3 minutes, 13.564377 seconds TIMER [Load input] 6 minutes, 51.144531 seconds TIMER [Build assembly graph / Distribute vertices] 3 minutes, 55.399902 secondscore_manager/1020925 dies TIMER [Build assembly graph / Distribute arcs] 7 minutes, 11.654602 seconds TIMER [Build assembly graph] 11 minutes, 7.054504 seconds

MemoryUtilization Checksum Beagle) sha1sum spate-2014-10-26-15-44-30/coverage_distribution.txt-canonical 01a293db48518190038eaddbaed8a47ca0323fc7 spate-2014-10-26-15-44-30/coverage_distribution.txt-canonical

GoodComments BadComments NeutralComments

sebhtml commented 9 years ago

JobName Goal Machine Cetus

AllocationStatus Path Commit Toolchain Script Submission Submitted build spate-2014-10-26-20-44-36 (1c94a2b173b4c4656cfc1f22dbc8a0cfb287b7af) 352071

MachineUtilization ComputationLoad RunningTime FAILED

MemoryUtilization Checksum GoodComments BadComments synchronized_graph_stores 15360/15360 126844601886/126844607728 graph store synchronization: actual_kmer_count 126844601886 total_kmer_count 126844607728 got mismatch, will try again

irb(main):001:0> 126844607728-126844601886 => 5842

NeutralComments

sebhtml commented 9 years ago

JobName Goal Machine Cetus

AllocationStatus Path Commit Toolchain Script Submission Submitted build spate-2014-10-27-00-11-16 (65ee155cf79af5eeba962f147c72bb728fcc2ae3) 352130

MachineUtilization ComputationLoad RunningTime MemoryUtilization Checksum GoodComments BadComments synchronized_graph_stores 15360/15360 126844605303/126844607728 graph store synchronization: actual_kmer_count 126844605303 total_kmer_count 126844607728 got mismatch, will try again

irb(main):002:0> 126844607728 - 126844605303 => 2425

NeutralComments

sebhtml commented 9 years ago

Last known working code on Cetus #747 540b436

sebhtml commented 9 years ago

JobName Goal test if there is a bug in the multiplexer that can explain why there are 2000 kmers missing.

Machine AllocationStatus Path Commit 65ee155cf79af5eeba962f147c72bb728fcc2ae3 with a timeout of 0 for multiplexer

Toolchain Script Submission [boisvert@cetuslac1 biosal-tests]$ ./soil-779-1024x16-1.sh 352442

MachineUtilization ComputationLoad RunningTime [boisvert@cetuslac1 biosal-tests]$ grep TIMER soil-779-1024x16-1.output
TIMER [Load input / Count input data] 51.858780 seconds TIMER [Load input / Distribute input data] 1 minutes, 45.178535 seconds TIMER [Load input] 2 minutes, 37.037308 seconds TIMER [Build assembly graph / Distribute vertices] 11 minutes, 31.577698 seconds

MemoryUtilization Checksum [boisvert@cetuslac1 biosal-tests]$ sha1sum soil-779-1024x16-1/coverage_distribution.txt-canonical 01a293db48518190038eaddbaed8a47ca0323fc7 soil-779-1024x16-1/coverage_distribution.txt-canonical

GoodComments BadComments graph store synchronization: actual_kmer_count 126792901245 total_kmer_count 126844607728 got mismatch, will try again

synchronized_graph_stores 15132/15360 124970090910/126844607728

and then it seems that the coverage distribution was spawned...

biosal_assembly_graph_store/1037883 will use coverage distribution 1071100

(?) NeutralComments

according to ALCF support:

"The stdout and stderr are channeled through the runjob process and we have seen some rare occurrences when the load is high that data might be dropped."

sebhtml commented 9 years ago

.

sebhtml commented 9 years ago

JobName Goal test without multiplexer

Machine Cetus

AllocationStatus Path Commit Toolchain Script

[boisvert@cetuslac1 automated-tests]$ cat spate-2014-10-27-17-08-34.sh 
#!/bin/bash

# echo "Commit= b3592bc026c74ef95afd35fa3fb2216225c38df3"

qsub \
 --env PAMID_THREAD_MULTIPLE=1 \
 -A CompBIO \
 -n 1024 \
 -t 01:00:00 \
 -O spate-2014-10-27-17-08-34 \
 --mode c1 \
     spate -print-load -threads-per-node 16 \
    -k 43 Iowa_Continuous_Corn/*.fastq \
    -o spate-2014-10-27-17-08-34

Submission Submitted build spate-2014-10-27-17-08-34 (b3592bc026c74ef95afd35fa3fb2216225c38df3) 352554

MachineUtilization ComputationLoad RunningTime [boisvert@cetuslac1 automated-tests]$ grep TIMER spate-2014-10-27-17-08-34.output TIMER [Load input / Count input data] 41.159801 seconds TIMER [Load input / Distribute input data] 1 minutes, 47.970085 seconds TIMER [Load input] 2 minutes, 29.129898 seconds TIMER [Build assembly graph / Distribute vertices] 11 minutes, 21.300964 seconds TIMER [Build assembly graph / Distribute arcs] 21 minutes, 51.119995 seconds TIMER [Build assembly graph] 33 minutes, 12.420898 seconds

MemoryUtilization Checksum GoodComments BadComments NeutralComments