Open sebhtml opened 10 years ago
JobName soil-690-2048x16-1
Machine Mira
AllocationStatus CompBIO= 820040.0 core-hours
Path /projects/CompBIO/Projects/biosal-tests
Commit [boisvert@miralac1 biosal]$ git log | head -n1 commit 7ccbe7051ff2fc29215dfab7bcad878d44d4e0c3
Toolchain /soft/compilers/ibmcmp-may2014/vac/bg/12.1/bin/.orig/bgxlc_r:
Script [boisvert@miralac1 biosal-tests]$ cat soil-690-2048x16-1.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 2048 \ -t 03:00:00 \ -O soil-690-2048x16-1 \ --mode c1 \ spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o soil-690-2048x16-1
Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-2048x16-1.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 332171
MachineUtilization not important
ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-2048x16-1.output TIMER [Load input / Count input data] 37.063904 seconds TIMER [Load input / Distribute input data] 2 minutes, 53.165146 seconds TIMER [Load input] 3 minutes, 30.229034 seconds TIMER [Build assembly graph / Distribute vertices] 7 minutes, 24.430878 seconds TIMER [Build assembly graph / Distribute arcs] 8 minutes, 38.779114 seconds TIMER [Build assembly graph] 16 minutes, 3.209961 seconds
MemoryUtilization Checksum GoodComments BadComments NeutralComments
JobName soil-690-4096x16-1
Machine Mira
AllocationStatus CompBIO= 820040.0 core-hours
Path /projects/CompBIO/Projects/biosal-tests
Commit 7ccbe7051ff2fc29215dfab7bcad878d44d4e0c3
Toolchain /soft/compilers/ibmcmp-may2014/vac/bg/12.1/bin/.orig/bgxlc_r:
Script [boisvert@miralac1 biosal-tests]$ cat ./soil-690-4096x16-1.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 4096 \ -t 03:00:00 \ -O soil-690-4096x16-1 \ --mode c1 \ spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o soil-690-4096x16-1
Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-4096x16-1.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 332172
MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-4096x16-1.output TIMER [Load input / Count input data] 47.448002 seconds TIMER [Load input / Distribute input data] 4 minutes, 25.178558 seconds TIMER [Load input] 5 minutes, 12.626556 seconds TIMER [Build assembly graph / Distribute vertices] 7 minutes, 13.168182 seconds
MemoryUtilization Checksum GoodComments 4096 nodes 4096 MPI ranks 65536 cores = 4096 Thorium nodes + 61440 Thorium workers around 1000 actors per worker (around 60 M actors in total)
BadComments NeutralComments
JobName Machine AllocationStatus Path Commit dc5663feba4c3b87b057b0b490a997698ff4679e
Toolchain Script [boisvert@miralac1 biosal-tests]$ cat soil-690-1024x16-1.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 03:00:00 \ -O soil-690-1024x16-1 \ --mode c1 \ spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o soil-690-1024x16-1
Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-1024x16-1.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 333531
MachineUtilization ComputationLoad RunningTime I fixed a bug. The multiplexer was leaking memory...
MemoryUtilization Checksum GoodComments BadComments NeutralComments
JobName Machine AllocationStatus Path Commit dc5663feba4c3b87b057b0b490a997698ff4679e updated->10149b6048e1dfc1e3b813066218bcec038059db
Toolchain Script [boisvert@miralac1 biosal-tests]$ cat soil-690-2048x16-2.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 2048 \ -t 03:00:00 \ -O soil-690-2048x16-2 \ --mode c1 \ spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o soil-690-2048x16-2
Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-2048x16-2.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 333533
MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-2048x16-2.output TIMER [Load input / Count input data] 46.997513 seconds TIMER [Load input / Distribute input data] 2 minutes, 51.644424 seconds TIMER [Load input] 3 minutes, 38.641937 seconds TIMER [Build assembly graph / Distribute vertices] 7 minutes, 54.092896 seconds TIMER [Build assembly graph / Distribute arcs] 8 minutes, 26.360657 seconds TIMER [Build assembly graph] 16 minutes, 20.453552 seconds
MemoryUtilization pool systems are buggy:
[boisvert@miralac1 biosal-tests]$ grep POOL soil-690-2048x16-2.output DEBUG_POOL Name= 2 SinceSnapshotActiveSegments= 1351106 (9233967 - 7882861) ActiveSegments= 1351106 (9233967 - 7882861) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 46598495616 (46598495616 - 0) DEBUG_POOL Name= 3 SinceSnapshotActiveSegments= 0 (3151778 - 3151778) ActiveSegments= 0 (3151778 - 3151778) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 5307321344 (5307321344 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7593700 (57614537 - 65208237) SnapshotActiveSegments= -7593700 (57614537 - 65208237) AllocatedBytes= 3077991821 (3077991821 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 708 (136372 - 135664) ActiveSegments= 708 (136372 - 135664) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 624469120 (624469120 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -8353736 (53753412 - 62107148) SnapshotActiveSegments= -8353736 (53753412 - 62107148) AllocatedBytes= 3029552645 (3029552645 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (135319 - 135318) ActiveSegments= 1 (135319 - 135318) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 588121600 (588121600 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7209466 (66284628 - 73494094) SnapshotActiveSegments= -7209466 (66284628 - 73494094) AllocatedBytes= 3683357471 (3683357471 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (135444 - 135443) ActiveSegments= 1 (135444 - 135443) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 721218560 (721218560 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -8820644 (65426505 - 74247149) SnapshotActiveSegments= -8820644 (65426505 - 74247149) AllocatedBytes= 3500035355 (3500035355 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (159915 - 159914) ActiveSegments= 1 (159915 - 159914) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 734842112 (734842112 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -8816274 (69578505 - 78394779) SnapshotActiveSegments= -8816274 (69578505 - 78394779) AllocatedBytes= 3813768552 (3813768552 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (159920 - 159919) ActiveSegments= 1 (159920 - 159919) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 767920896 (767920896 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7504973 (56204768 - 63709741) SnapshotActiveSegments= -7504973 (56204768 - 63709741) AllocatedBytes= 3024674231 (3024674231 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (159911 - 159910) ActiveSegments= 1 (159911 - 159910) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 604280576 (604280576 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -9953168 (111976479 - 121929647) SnapshotActiveSegments= -9953168 (111976479 - 121929647) AllocatedBytes= 4151413769 (4151413769 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (525350 - 525349) ActiveSegments= 1 (525350 - 525349) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 713435264 (713435264 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7213108 (104258815 - 111471923) SnapshotActiveSegments= -7213108 (104258815 - 111471923) AllocatedBytes= 3623808184 (3623808184 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (500768 - 500767) ActiveSegments= 1 (500768 - 500767) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 616991360 (616991360 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -6072074 (98500885 - 104572959) SnapshotActiveSegments= -6072074 (98500885 - 104572959) AllocatedBytes= 3544720318 (3544720318 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (463898 - 463897) ActiveSegments= 1 (463898 - 463897) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 542231168 (542231168 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7158160 (104660737 - 111818897) SnapshotActiveSegments= -7158160 (104660737 - 111818897) AllocatedBytes= 3613930975 (3613930975 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 2029 (2295392 - 2293363) ActiveSegments= 2029 (2295392 - 2293363) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 834120448 (834120448 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -6086028 (104815891 - 110901919) SnapshotActiveSegments= -6086028 (104815891 - 110901919) AllocatedBytes= 3753892125 (3753892125 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (476190 - 476189) ActiveSegments= 1 (476190 - 476189) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 614874624 (614874624 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7685116 (112027476 - 119712592) SnapshotActiveSegments= -7685116 (112027476 - 119712592) AllocatedBytes= 4088439708 (4088439708 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (500758 - 500757) ActiveSegments= 1 (500758 - 500757) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 694589440 (694589440 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7879013 (55920040 - 63799053) SnapshotActiveSegments= -7879013 (55920040 - 63799053) AllocatedBytes= 3120891696 (3120891696 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (147619 - 147618) ActiveSegments= 1 (147619 - 147618) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 636470912 (636470912 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -8341234 (62276421 - 70617655) SnapshotActiveSegments= -8341234 (62276421 - 70617655) AllocatedBytes= 3383011803 (3383011803 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (147617 - 147616) ActiveSegments= 1 (147617 - 147616) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 679147264 (679147264 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7403626 (61988581 - 69392207) SnapshotActiveSegments= -7403626 (61988581 - 69392207) AllocatedBytes= 3193199716 (3193199716 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (159896 - 159895) ActiveSegments= 1 (159896 - 159895) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 667083904 (667083904 - 0)
Checksum GoodComments BadComments NeutralComments
JobName Machine AllocationStatus Path Commit dc5663f updated->10149b6048e1dfc1e3b813066218bcec038059db
Toolchain Script [boisvert@miralac1 biosal-tests]$ cat soil-690-4096x16-2.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 4096 \ -t 03:00:00 \ -O soil-690-4096x16-2 \ --mode c1 \ spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o soil-690-4096x16-2
Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-4096x16-2.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 333535
MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-4096x16-2.output TIMER [Load input / Count input data] 51.708893 seconds TIMER [Load input / Distribute input data] 4 minutes, 28.937408 seconds TIMER [Load input] 5 minutes, 20.646301 seconds TIMER [Build assembly graph / Distribute vertices] 10 minutes, 25.557434 seconds TIMER [Build assembly graph / Distribute arcs] 20 minutes, 57.344849 seconds TIMER [Build assembly graph] 31 minutes, 22.902344 seconds
MemoryUtilization Checksum GoodComments BadComments NeutralComments
JobName Machine AllocationStatus Path Commit dc5663f
Toolchain Script [boisvert@miralac1 biosal-tests]$ cat soil-690-512x16-1.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 512 \ -t 03:00:00 \ -O soil-690-512x16-1 \ --mode c1 \ spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o soil-690-512x16-1
Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-512x16-1.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 333537
MachineUtilization ComputationLoad RunningTime out of memory, I now solved the issue.
MemoryUtilization Checksum GoodComments BadComments NeutralComments
JobName Machine AllocationStatus Path Commit 10149b6048e1dfc1e3b813066218bcec038059db
Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-1024x16-2.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 333623
MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-1024x16-2.output TIMER [Load input / Count input data] 52.247120 seconds TIMER [Load input / Distribute input data] 1 minutes, 47.248177 seconds TIMER [Load input] 2 minutes, 39.495300 seconds TIMER [Build assembly graph / Distribute vertices] 10 minutes, 10.982605 seconds TIMER [Build assembly graph / Distribute arcs] 12 minutes, 52.842712 seconds TIMER [Build assembly graph] 23 minutes, 3.825317 seconds
MemoryUtilization Checksum [boisvert@miralac1 biosal-tests]$ sha1sum soil-690-1024x16-2/coverage_distribution.txt-canonical 01a293db48518190038eaddbaed8a47ca0323fc7 soil-690-1024x16-2/coverage_distribution.txt-canonical
GoodComments BadComments NeutralComments
JobName Machine AllocationStatus Path Commit 10149b6048e1dfc1e3b813066218bcec038059db
Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-512x16-2.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 333622
MachineUtilization ComputationLoad RunningTime MemoryUtilization [boisvert@miralac1 biosal-tests]$ grep Error soil-690-512x16-2.output DEBUG Error bsal_memory_allocate returned (nil), 1040187392 bytes
Checksum GoodComments BadComments NeutralComments
mira is in maintenance.
JobName Machine AllocationStatus Path Commit 96a356661370cbdd14961fa5f8b16f8f6949d18b
Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-512x16-3.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 334227
MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-512x16-3.output TIMER [Load input / Count input data] 1 minutes, 5.614647 seconds TIMER [Load input / Distribute input data] 1 minutes, 32.000870 seconds TIMER [Load input] 2 minutes, 37.615524 seconds
MemoryUtilization [boisvert@miralac1 biosal-tests]$ grep Error soil-690-512x16-3.output DEBUG Error bsal_memory_allocate returned (nil), 1040187392 bytes ActiveSegments= 127326
Checksum GoodComments BadComments NeutralComments
JobName Machine AllocationStatus Path Commit 96a356661370cbdd14961fa5f8b16f8f6949d18b
Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-1024x16-3.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 334228
MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-1024x16-3.output TIMER [Load input / Count input data] 43.734001 seconds TIMER [Load input / Distribute input data] 1 minutes, 57.532341 seconds TIMER [Load input] 2 minutes, 41.266342 seconds TIMER [Build assembly graph / Distribute vertices] 10 minutes, 4.200684 seconds TIMER [Build assembly graph / Distribute arcs] 12 minutes, 47.443115 seconds TIMER [Build assembly graph] 22 minutes, 51.643799 seconds
MemoryUtilization [boisvert@miralac1 biosal-tests]$ grep Error soil-690-1024x16-3.output DEBUG Error bsal_memory_allocate returned (nil), 8192 bytes used / total -> 17090740224 / 17163091968
Checksum GoodComments BadComments NeutralComments
JobName
Machine
AllocationStatus
Path
Commit
96a356661370cbdd14961fa5f8b16f8f6949d18b
84bb626d22252a5d4603647e0255d1c9876f97b7
5e0b35619e7b3f663e5663896d9e761004135954
Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-2048x16-3.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 334229
MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-2048x16-3.output TIMER [Load input / Count input data] 37.634472 seconds TIMER [Load input / Distribute input data] 2 minutes, 53.514832 seconds TIMER [Load input] 3 minutes, 31.149307 seconds TIMER [Build assembly graph / Distribute vertices] 6 minutes, 47.528839 seconds
MemoryUtilization Checksum GoodComments BadComments out of memory
NeutralComments
JobName
Machine
AllocationStatus
Path
Commit
96a356661370cbdd14961fa5f8b16f8f6949d18b
84bb626d22252a5d4603647e0255d1c9876f97b7
5e0b35619e7b3f663e5663896d9e761004135954
Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-4096x16-3.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 334231
MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-4096x16-3.output TIMER [Load input / Count input data] 1 minutes, 3.382736 seconds TIMER [Load input / Distribute input data] 4 minutes, 24.529755 seconds TIMER [Load input] 5 minutes, 27.912506 seconds TIMER [Build assembly graph / Distribute vertices] 97 minutes, 42.688477 seconds
MemoryUtilization Checksum GoodComments BadComments out of memory
NeutralComments
334231 soil-690-4096x16-3 boisvert 916.3 04:00:00 103:15:14 N/A 4096 queued None c1 4096 prod-short N/A
334229 soil-690-2048x16-3 boisvert 1.1 K 03:00:00 103:15:59 N/A 2048 queued None c1 2048 prod-short N/A
JobName
Machine
AllocationStatus
Path
Commit
84bb626d22252a5d4603647e0255d1c9876f97b7
5e0b35619e7b3f663e5663896d9e761004135954
Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-512x16-4.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 336506
MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-512x16-4.output TIMER [Load input / Count input data] 1 minutes, 2.588917 seconds TIMER [Load input / Distribute input data] 1 minutes, 38.910957 seconds TIMER [Load input] 2 minutes, 41.499878 seconds
MemoryUtilization Checksum GoodComments BadComments out of memory
NeutralComments
JobName
Machine
AllocationStatus
Path
Commit
84bb626d22252a5d4603647e0255d1c9876f97b7
5e0b35619e7b3f663e5663896d9e761004135954
Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-1024x16-4.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 336507
MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-1024x16-4.output TIMER [Load input / Count input data] 38.235104 seconds TIMER [Load input / Distribute input data] 1 minutes, 46.555290 seconds TIMER [Load input] 2 minutes, 24.790390 seconds
MemoryUtilization out of memory
Checksum GoodComments BadComments NeutralComments
JobName
Machine
AllocationStatus
Path
Commit
04fe9b871d9fc480a90696f1bf12073c18a6abf3
9a2fd78d2ffcc350bdc4e1b653208c35f44989ac
Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-512x16-5.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 337447
MachineUtilization ComputationLoad RunningTime cancelled
MemoryUtilization Checksum GoodComments BadComments NeutralComments
JobName
Machine
AllocationStatus
Path
Commit
04fe9b871d9fc480a90696f1bf12073c18a6abf3
9a2fd78d2ffcc350bdc4e1b653208c35f44989ac
Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-1024x16-5.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 337448
MachineUtilization ComputationLoad RunningTime SIGSEGV because of memory usage
MemoryUtilization Checksum GoodComments BadComments NeutralComments
JobName
Machine
AllocationStatus
Path
Commit
04fe9b871d9fc480a90696f1bf12073c18a6abf3
9a2fd78d2ffcc350bdc4e1b653208c35f44989ac
Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-2048x16-4.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 337449
MachineUtilization ComputationLoad RunningTime cancelled
MemoryUtilization Checksum GoodComments BadComments NeutralComments
JobName
Machine
AllocationStatus
Path
Commit
04fe9b871d9fc480a90696f1bf12073c18a6abf3
9a2fd78d2ffcc350bdc4e1b653208c35f44989ac
Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-4096x16-4.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 337450
MachineUtilization ComputationLoad RunningTime cancelled
MemoryUtilization Checksum GoodComments BadComments NeutralComments
will wait for new allocation to kick in
dependencies #709 #723
JobName Goal Evaluate Spate on Blue Gene/Q since Beagle will be down until December 2014
Machine AllocationStatus Path Commit 4908fd9090bd79d9373021eb9b746c65bfb414bc
Toolchain Script Submission Submitted build spate-2014-11-10-18-34-23 (4908fd9090bd79d9373021eb9b746c65bfb414bc) 361599
MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 automated-tests]$ grep TIMER spate-2014-11-10-18-34-23.output TIMER [Load input / Count input data] 1 minutes, 1.136402 seconds TIMER [Load input / Distribute input data] 2 minutes, 44.771774 seconds TIMER [Load input] 3 minutes, 45.908173 seconds TIMER [Build assembly graph / Distribute vertices] core_manager/1055741 dies [boisvert@miralac1 automated-tests]$ grep minutes spate-2014-11-10-18-34-23.output TIMER [Load input / Count input data] 1 minutes, 1.136402 seconds TIMER [Load input / Distribute input data] 2 minutes, 44.771774 seconds TIMER [Load input] 3 minutes, 45.908173 seconds 28 minutes, 48.453003 seconds
MemoryUtilization Checksum GoodComments BadComments NeutralComments
Last good:
spate-2014-10-27-17-08-34 (b3592bc) https://github.com/GeneAssembly/biosal/issues/775#issuecomment-60637238 the multiplexer was disabled.
JobName Goal Verify if the slow-down is caused by the multiplexer. If it is the case, ticket #805 should fix it. In this run, assertions were turned on
Machine AllocationStatus Path Commit ed8f3d2e4d82df226dd2a3dc1d998135acceb1dc
Toolchain Script [boisvert@miralac1 automated-tests]$ cat spate-2014-11-11-20-05-40.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 01:00:00 \ -O spate-2014-11-11-20-05-40 \ --mode c1 \ spate-2014-11-11-20-05-40.spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o spate-2014-11-11-20-05-40 -disable-multiplexer
Submission [boisvert@miralac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh
(added -disable-multiplexer)
[boisvert@miralac1 automated-tests]$ ./spate-2014-11-11-20-05-40.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 362328
MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 automated-tests]$ grep TIMER spate-2014-11-11-20-05-40.output TIMER [Load input / Count input data] 40.701588 seconds TIMER [Load input / Distribute input data] 1 minutes, 43.710915 seconds TIMER [Load input] 2 minutes, 24.412506 seconds TIMER [Build assembly graph / Distribute vertices] 22 minutes, 21.978760 seconds TIMER [Build assembly graph / Distribute arcs] 28 minutes, 13.162842 seconds TIMER [Build assembly graph] 50 minutes, 35.141602 seconds
MemoryUtilization Checksum GoodComments BadComments NeutralComments
Goal Test without assertions
test on Mira with new build script (CONFIG_DEBUG=n) since DEBUG=y generates much slower code.
[boisvert@miralac1 biosal]$ tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh
Submitted build spate-2014-11-12-04-00-21 (211d4e1893bc70617764ba534068e3d2f196dbdd) 362582
[boisvert@miralac1 automated-tests]$ grep TIMER spate-2014-11-12-04-00-21.output TIMER [Load input / Count input data] 42.059803 seconds TIMER [Load input / Distribute input data] 4 minutes, 3.187134 seconds TIMER [Load input] 4 minutes, 45.246918 seconds TIMER [Build assembly graph / Distribute vertices] 14 minutes, 26.619629 seconds TIMER [Build assembly graph / Distribute arcs] 28 minutes, 12.110107 seconds TIMER [Build assembly graph] 42 minutes, 38.729736 seconds
JobName Goal Test without multiplexer and without assertions
Machine Mira
AllocationStatus Path Commit 211d4e1893bc70617764ba534068e3d2f196dbdd
Toolchain Script [boisvert@miralac1 automated-tests]$ cat spate-2014-11-12-15-48-15.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 01:00:00 \ -O spate-2014-11-12-15-48-15 \ --mode c1 \ spate-2014-11-12-15-48-15.spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o spate-2014-11-12-15-48-15 -disable-multiplexer
Submission [boisvert@miralac1 automated-tests]$ ./spate-2014-11-12-15-48-15.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 362815
MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 automated-tests]$ grep TIMER spate-2014-11-12-15-48-15.output TIMER [Load input / Count input data] 40.135349 seconds TIMER [Load input / Distribute input data] 1 minutes, 43.414711 seconds TIMER [Load input] 2 minutes, 23.550064 seconds TIMER [Build assembly graph / Distribute vertices] 11 minutes, 20.589844 seconds TIMER [Build assembly graph / Distribute arcs] 22 minutes, 30.075806 seconds TIMER [Build assembly graph] 33 minutes, 50.665771 seconds
MemoryUtilization thorium_node: node/538 METRICS AliveActorCount: 1476 ByteCount: 14148833280 / 17163091968
Checksum [boisvert@miralac1 automated-tests]$ sha1sum spate-2014-11-12-15-48-15/coverage_distribution.txt-canonical 01a293db48518190038eaddbaed8a47ca0323fc7 spate-2014-11-12-15-48-15/coverage_distribution.txt-canonical
GoodComments -disable-multiplexer won't be required because the multiplexer is now adaptive.
BadComments It ran out of memory because of all these small messages.
NeutralComments
JobName Goal Test adaptive multiplexer algorithm and lower k-mer length
Machine AllocationStatus Path Commit 509472f94bdbb14b13346ee5d3a9573763a2f5f3
Toolchain Script Submission [boisvert@miralac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-13-22-09-23 (509472f94bdbb14b13346ee5d3a9573763a2f5f3) 364013
MachineUtilization ComputationLoad RunningTime cancelled, problem with timeout
MemoryUtilization Checksum GoodComments BadComments NeutralComments
JobName Goal Test manual activation of multiplexer
Machine AllocationStatus Path Commit Toolchain Script Submission
[boisvert@miralac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh \ Project 'compbio'; job rerouted to queue 'prod-short' Submitted build spate-2014-11-14-02-46-11 (b0ecb745f90d7261a116a6175c7e57beb4b4b680) 364200
MachineUtilization ComputationLoad RunningTime
[boisvert@cetuslac1 automated-tests]$ grep TIMER spate-2014-11-14-02-46-11.output
TIMER [Load input / Count input data] 39.870667 seconds
TIMER [Load input / Distribute input data] 1 minutes, 42.208771 seconds
TIMER [Load input] 2 minutes, 22.079437 seconds
TIMER [Build assembly graph / Distribute vertices] 13 minutes, 29.460754 seconds
TIMER [Build assembly graph / Distribute arcs] 26 minutes, 22.444458 seconds
TIMER [Build assembly graph] 39 minutes, 51.905273 seconds
MemoryUtilization Ran out of memory here:
DEBUG Error core_memory_allocate returned (nil), 8388608 bytes used / total -> 15699095552 / 17163091968
[boisvert@cetuslac1 automated-tests]$ addr2line -e spate-2014-11-14-02-46-11.spate < spate-2014-11-14-02-46-11.stack /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/tracer.c:42 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory.c:128 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory_pool.c:148 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/structures/vector.c:181 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/helpers/vector_helper.c:351 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/patterns/manager.c:314 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/actor.c:1207 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:1909 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:1685 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:713 /bgsys/drivers/V1R2M2/ppc64/toolchain/gnu/glibc-2.12.2/nptl/pthread_create.c:322 :0
[boisvert@cetuslac1 automated-tests]$ grep GRAPH spate-2014-11-14-02-46-11.output GRAPH -> 148043434448 vertices, 324996128800 vertex observations, and 131189785122 arcs.
This is 217 bytes per k-mer...
irb(main):002:0> 15699095552 / (148043434448 / 2 / 1024) => 217
biosal_assembly_sliding_window/1051763 kmer length is 27
So there is a lot of memory fragmentation.
after loading input data: thorium_node: node/385 METRICS AliveActorCount: 61 ByteCount: 1937833984 / 17163091968
after vertices extraction: thorium_node: node/60 METRICS AliveActorCount: 61 ByteCount: 13802754048 / 17163091968
thorium_node: node/958 METRICS AliveActorCount: 61 ByteCount: 15634251776 / 17163091968
http://www.idris.fr/eng/turing/turing-fragmentation_memoire-eng.html
Checksum GoodComments BadComments NeutralComments
needs: #819 #820
Goal Check if there is less fragmentation now.
[boisvert@miralac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-14-23-39-40 (558c83b2ccf26061031a66a9dafeb41492712931) 364888
Goal Test with normalized allocation sizes.
[boisvert@cetuslac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-15-04-07-25 (c4d7a9d0e8f66104b750fbac825532425fa4e39a) 364993
[boisvert@miralac1 automated-tests]$ grep TIMER spate-2014-11-15-04-07-25.output
TIMER [Load input / Count input data] 41.663147 seconds
TIMER [Load input / Distribute input data] 1 minutes, 42.034645 seconds
TIMER [Load input] 2 minutes, 23.697800 seconds
TIMER [Build assembly graph / Distribute vertices] 12 minutes, 53.424500 secondscore_manager/1055741 dies
TIMER [Build assembly graph / Distribute arcs] 25 minutes, 31.461914 seconds
TIMER [Build assembly graph] 38 minutes, 24.886475 seconds
[boisvert@miralac1 automated-tests]$ grep " ByteCount" spate-2014-11-15-04-07-25.output |tail -n1 thorium_node: node/1021 METRICS AliveActorCount: 62 ByteCount: 16198402048 / 17163091968
still too much memory fragmentation
Goal
Try with less memory fragmentation
[boisvert@cetuslac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-16-04-27-22 (032b24c9a77afc711c54718bc058f446dcb58f2f) 365409
job | after "TIMER [Build assembly graph / Distribute vertices]" | after "TIMER [Build assembly graph / Distribute arcs]" |
---|---|---|
spate-2014-11-15-04-07-25 | ByteCount: 11990700032 / 17163091968 | ByteCount: 16198402048 / 17163091968 |
spate-2014-11-16-04-27-22 | ByteCount: 12129284096 / 17163091968 | ByteCount: 13922926592 / 17163091968 |
DEBUG Error core_memory_allocate returned (nil), 8388608 bytes used / total -> 13805076480 / 17163091968 DEBUG_NODE Name= 413 WorkerCount= 15 MEMORY used / total -> 13805076480 / 17163091968
This is some nasty memory fragmentation. CNK is very bad at memory management.
Stack: [boisvert@cetuslac1 automated-tests]$ addr2line -e spate-2014-11-16-04-27-22.spate < spate-2014-11-16-04-27-22.stack /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/tracer.c:42 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory.c:135 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory_pool.c:342 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory_pool.c:278 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/structures/vector.c:181 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/helpers/vector_helper.c:351 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/patterns/manager.c:321 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/actor.c:1225 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:1914 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:1690 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:718 /bgsys/drivers/V1R2M2/ppc64/toolchain/gnu/glibc-2.12.2/nptl/pthread_create.c:322 :0 ??:0
Memory usage:
Key | Name | Reported | Actors | Total |
---|---|---|---|---|
0x84a83916 | MEMORY_POOLNAME SEQUENCE_STORE | 30801920 | 15 | 462028800 |
0x46d316e4 | MEMORY_POOLNAME WORKER_OUTBOUND |
[boisvert@cetuslac1 automated-tests]$ size spate-2014-11-16-04-27-22.spate text data bss dec hex filename 8327049 1192404 1097452 10616905 a20049 spate-2014-11-16-04-27-22.spate
jemalloc test: #821
JobName spate-690-2048x16-1
Goal Do a complete assembly on Iowa Continuous Corn on Mira.
Machine Mira @ ALCF AllocationStatus Available= 837120.3
Path /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests
Commit feb224e93 stock, no patch
Toolchain XL
Script [boisvert@miralac1 automated-tests]$ cat spate-690-2048x16-1.sh
qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 2048 \ -t 03:00:00 \ -O spate-690-2048x16-1 \ --mode c1 \ spate-690-2048x16-1.spate -print-load -threads-per-node 16 \ -k 33 Iowa_Continuous_Corn/*.fastq \ -o spate-690-2048x16-1
Submission [boisvert@miralac1 automated-tests]$ ./spate-690-2048x16-1.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 369649
MachineUtilization ComputationLoad RunningTime [boisvert@cetuslac1 automated-tests]$ grep TIMER spate-690-2048x16-1.output TIMER [Load input / Count input data] 48.354530 seconds TIMER [Load input / Distribute input data] 2 minutes, 51.379822 seconds TIMER [Load input] 3 minutes, 39.734344 seconds TIMER [Build assembly graph / Distribute vertices] 6 minutes, 17.040649 seconds TIMER [Build assembly graph / Distribute arcs] 11 minutes, 45.084351 seconds TIMER [Build assembly graph] 18 minutes, 2.125000 seconds
MemoryUtilization Checksum GoodComments [boisvert@cetuslac1 automated-tests]$ grep visitors spate-690-2048x16-1.output DEBUG the system has 3072000 visitors
[boisvert@cetuslac1 automated-tests]$ grep GRAPH spate-690-2048x16-1.output GRAPH -> 148375705714 vertices, 298256036296 vertex observations, and 115990208309 arcs.
irb(main):001:0> 3072000_3500 => 10752000000 irb(main):003:0> 10752000000.0/148375705714_100 => 7.246469324786163
BadComments [boisvert@cetuslac1 automated-tests]$ tail spate-690-2048x16-1.cobaltlog -n1 Tue Nov 25 06:45:18 2014 +0000 (UTC) Info: maximum execution time exceeded; initiating job termination
NeutralComments
The ratio is too low for 2048x16:
node count: 2048 store count: 204815 = 30720 visitor count per node: 15 \ 100 = 1500 ratio: 2048 / 1500 = 1.36
for 256x24:
ratio = 256 / (22 * 100) = 0.11
[boisvert@cetuslac1 automated-tests]$ grep "node/1543" spate-690-2048x16-1.output | grep LOAD |tail thorium_worker_pool: node/1543 EPOCH LOAD 9523 s 0.20/15 (0.01) 0.02 0.01 0.01 0.02 0.02 0.01 0.02 0.01 0.01 0.01 0.01 0.02 0.01 0.02 0.01 thorium_worker_pool: node/1543 EPOCH LOAD 9533 s 0.18/15 (0.01) 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.02 0.01 0.01 thorium_worker_pool: node/1543 EPOCH LOAD 9543 s 0.25/15 (0.02) 0.02 0.01 0.02 0.02 0.01 0.02 0.02 0.01 0.02 0.01 0.02 0.02 0.03 0.02 0.01 thorium_worker_pool: node/1543 EPOCH LOAD 9553 s 0.46/15 (0.03) 0.02 0.04 0.03 0.03 0.03 0.02 0.03 0.04 0.03 0.04 0.02 0.02 0.04 0.02 0.04 thorium_worker_pool: node/1543 EPOCH LOAD 9563 s 1.24/15 (0.08) 0.04 0.21 0.04 0.04 0.04 0.04 0.04 0.03 0.04 0.25 0.06 0.04 0.03 0.04 0.33 thorium_worker_pool: node/1543 EPOCH LOAD 9573 s 1.45/15 (0.10) 0.36 0.04 0.03 0.04 0.03 0.04 0.04 0.03 0.27 0.03 0.03 0.39 0.04 0.04 0.04 thorium_worker_pool: node/1543 EPOCH LOAD 9583 s 1.14/15 (0.08) 0.24 0.02 0.03 0.03 0.03 0.03 0.02 0.02 0.03 0.36 0.03 0.24 0.02 0.03 0.02 thorium_worker_pool: node/1543 EPOCH LOAD 9593 s 1.02/15 (0.07) 0.36 0.02 0.02 0.03 0.03 0.02 0.02 0.03 0.03 0.02 0.03 0.36 0.02 0.03 0.02 thorium_worker_pool: node/1543 EPOCH LOAD 9603 s 1.33/15 (0.09) 0.02 0.02 0.02 0.02 0.45 0.02 0.02 0.02 0.02 0.01 0.02 0.03 0.01 0.02 0.62 thorium_worker_pool: node/1543 EPOCH LOAD 9613 s 2.03/15 (0.14) 0.01 0.02 0.00 0.01 0.02 0.01 0.00 0.38 0.41 0.53 0.01 0.01 0.01 0.00 0.62
irb(main):002:0> (77721189 - 77684590) / 10.0 => 3659.9
better ratio: irb(main):015:0> 2048 / (15 * 100.0) => 1.3653333333333333 irb(main):016:0> 2048 / (15 * 400.0) => 0.3413333333333333
needs: #829
2014-09-20 I disabled the multiplexer since it seems to cause memory usage issues 2014-09-21 I fixed the huge leak in multiplexer. 2014-09-22 fixed leak of inbound buffers
Jobs:
soil-690-512x16-3soil-690-1024x16-3soil-690-2048x16-3soil-690-4096x16-3soil-690-512x16-4soil-690-1024x16-4soil-690-512x16-5 soil-690-1024x16-5 soil-690-2048x16-4 soil-690-4096x16-4