GeneAssembly / biosal

biosal is a distributed BIOlogical Sequence Actor Library. THIS IS A MIRROR.
BSD 2-Clause "Simplified" License
6 stars 1 forks source link

do a test with Spate on Mira on Iowa Continuous Corn 2048x16 #690

Open sebhtml opened 10 years ago

sebhtml commented 10 years ago

2014-09-20 I disabled the multiplexer since it seems to cause memory usage issues 2014-09-21 I fixed the huge leak in multiplexer. 2014-09-22 fixed leak of inbound buffers

Jobs:

soil-690-512x16-3 soil-690-1024x16-3 soil-690-2048x16-3 soil-690-4096x16-3 soil-690-512x16-4 soil-690-1024x16-4 soil-690-512x16-5 soil-690-1024x16-5 soil-690-2048x16-4 soil-690-4096x16-4

sebhtml commented 10 years ago

JobName soil-690-2048x16-1

Machine Mira

AllocationStatus CompBIO= 820040.0 core-hours

Path /projects/CompBIO/Projects/biosal-tests

Commit [boisvert@miralac1 biosal]$ git log | head -n1 commit 7ccbe7051ff2fc29215dfab7bcad878d44d4e0c3

Toolchain /soft/compilers/ibmcmp-may2014/vac/bg/12.1/bin/.orig/bgxlc_r:

Script [boisvert@miralac1 biosal-tests]$ cat soil-690-2048x16-1.sh

!/bin/bash

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 2048 \ -t 03:00:00 \ -O soil-690-2048x16-1 \ --mode c1 \ spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o soil-690-2048x16-1

Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-2048x16-1.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 332171

MachineUtilization not important

ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-2048x16-1.output TIMER [Load input / Count input data] 37.063904 seconds TIMER [Load input / Distribute input data] 2 minutes, 53.165146 seconds TIMER [Load input] 3 minutes, 30.229034 seconds TIMER [Build assembly graph / Distribute vertices] 7 minutes, 24.430878 seconds TIMER [Build assembly graph / Distribute arcs] 8 minutes, 38.779114 seconds TIMER [Build assembly graph] 16 minutes, 3.209961 seconds

MemoryUtilization Checksum GoodComments BadComments NeutralComments

sebhtml commented 10 years ago

JobName soil-690-4096x16-1

Machine Mira

AllocationStatus CompBIO= 820040.0 core-hours

Path /projects/CompBIO/Projects/biosal-tests

Commit 7ccbe7051ff2fc29215dfab7bcad878d44d4e0c3

Toolchain /soft/compilers/ibmcmp-may2014/vac/bg/12.1/bin/.orig/bgxlc_r:

Script [boisvert@miralac1 biosal-tests]$ cat ./soil-690-4096x16-1.sh

!/bin/bash

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 4096 \ -t 03:00:00 \ -O soil-690-4096x16-1 \ --mode c1 \ spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o soil-690-4096x16-1

Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-4096x16-1.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 332172

MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-4096x16-1.output TIMER [Load input / Count input data] 47.448002 seconds TIMER [Load input / Distribute input data] 4 minutes, 25.178558 seconds TIMER [Load input] 5 minutes, 12.626556 seconds TIMER [Build assembly graph / Distribute vertices] 7 minutes, 13.168182 seconds

MemoryUtilization Checksum GoodComments 4096 nodes 4096 MPI ranks 65536 cores = 4096 Thorium nodes + 61440 Thorium workers around 1000 actors per worker (around 60 M actors in total)

BadComments NeutralComments

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit dc5663feba4c3b87b057b0b490a997698ff4679e

Toolchain Script [boisvert@miralac1 biosal-tests]$ cat soil-690-1024x16-1.sh

!/bin/bash

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 03:00:00 \ -O soil-690-1024x16-1 \ --mode c1 \ spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o soil-690-1024x16-1

Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-1024x16-1.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 333531

MachineUtilization ComputationLoad RunningTime I fixed a bug. The multiplexer was leaking memory...

MemoryUtilization Checksum GoodComments BadComments NeutralComments

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit dc5663feba4c3b87b057b0b490a997698ff4679e updated->10149b6048e1dfc1e3b813066218bcec038059db

Toolchain Script [boisvert@miralac1 biosal-tests]$ cat soil-690-2048x16-2.sh

!/bin/bash

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 2048 \ -t 03:00:00 \ -O soil-690-2048x16-2 \ --mode c1 \ spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o soil-690-2048x16-2

Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-2048x16-2.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 333533

MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-2048x16-2.output TIMER [Load input / Count input data] 46.997513 seconds TIMER [Load input / Distribute input data] 2 minutes, 51.644424 seconds TIMER [Load input] 3 minutes, 38.641937 seconds TIMER [Build assembly graph / Distribute vertices] 7 minutes, 54.092896 seconds TIMER [Build assembly graph / Distribute arcs] 8 minutes, 26.360657 seconds TIMER [Build assembly graph] 16 minutes, 20.453552 seconds

MemoryUtilization pool systems are buggy:

[boisvert@miralac1 biosal-tests]$ grep POOL soil-690-2048x16-2.output DEBUG_POOL Name= 2 SinceSnapshotActiveSegments= 1351106 (9233967 - 7882861) ActiveSegments= 1351106 (9233967 - 7882861) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 46598495616 (46598495616 - 0) DEBUG_POOL Name= 3 SinceSnapshotActiveSegments= 0 (3151778 - 3151778) ActiveSegments= 0 (3151778 - 3151778) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 5307321344 (5307321344 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7593700 (57614537 - 65208237) SnapshotActiveSegments= -7593700 (57614537 - 65208237) AllocatedBytes= 3077991821 (3077991821 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 708 (136372 - 135664) ActiveSegments= 708 (136372 - 135664) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 624469120 (624469120 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -8353736 (53753412 - 62107148) SnapshotActiveSegments= -8353736 (53753412 - 62107148) AllocatedBytes= 3029552645 (3029552645 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (135319 - 135318) ActiveSegments= 1 (135319 - 135318) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 588121600 (588121600 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7209466 (66284628 - 73494094) SnapshotActiveSegments= -7209466 (66284628 - 73494094) AllocatedBytes= 3683357471 (3683357471 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (135444 - 135443) ActiveSegments= 1 (135444 - 135443) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 721218560 (721218560 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -8820644 (65426505 - 74247149) SnapshotActiveSegments= -8820644 (65426505 - 74247149) AllocatedBytes= 3500035355 (3500035355 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (159915 - 159914) ActiveSegments= 1 (159915 - 159914) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 734842112 (734842112 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -8816274 (69578505 - 78394779) SnapshotActiveSegments= -8816274 (69578505 - 78394779) AllocatedBytes= 3813768552 (3813768552 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (159920 - 159919) ActiveSegments= 1 (159920 - 159919) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 767920896 (767920896 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7504973 (56204768 - 63709741) SnapshotActiveSegments= -7504973 (56204768 - 63709741) AllocatedBytes= 3024674231 (3024674231 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (159911 - 159910) ActiveSegments= 1 (159911 - 159910) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 604280576 (604280576 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -9953168 (111976479 - 121929647) SnapshotActiveSegments= -9953168 (111976479 - 121929647) AllocatedBytes= 4151413769 (4151413769 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (525350 - 525349) ActiveSegments= 1 (525350 - 525349) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 713435264 (713435264 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7213108 (104258815 - 111471923) SnapshotActiveSegments= -7213108 (104258815 - 111471923) AllocatedBytes= 3623808184 (3623808184 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (500768 - 500767) ActiveSegments= 1 (500768 - 500767) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 616991360 (616991360 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -6072074 (98500885 - 104572959) SnapshotActiveSegments= -6072074 (98500885 - 104572959) AllocatedBytes= 3544720318 (3544720318 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (463898 - 463897) ActiveSegments= 1 (463898 - 463897) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 542231168 (542231168 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7158160 (104660737 - 111818897) SnapshotActiveSegments= -7158160 (104660737 - 111818897) AllocatedBytes= 3613930975 (3613930975 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 2029 (2295392 - 2293363) ActiveSegments= 2029 (2295392 - 2293363) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 834120448 (834120448 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -6086028 (104815891 - 110901919) SnapshotActiveSegments= -6086028 (104815891 - 110901919) AllocatedBytes= 3753892125 (3753892125 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (476190 - 476189) ActiveSegments= 1 (476190 - 476189) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 614874624 (614874624 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7685116 (112027476 - 119712592) SnapshotActiveSegments= -7685116 (112027476 - 119712592) AllocatedBytes= 4088439708 (4088439708 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (500758 - 500757) ActiveSegments= 1 (500758 - 500757) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 694589440 (694589440 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7879013 (55920040 - 63799053) SnapshotActiveSegments= -7879013 (55920040 - 63799053) AllocatedBytes= 3120891696 (3120891696 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (147619 - 147618) ActiveSegments= 1 (147619 - 147618) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 636470912 (636470912 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -8341234 (62276421 - 70617655) SnapshotActiveSegments= -8341234 (62276421 - 70617655) AllocatedBytes= 3383011803 (3383011803 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (147617 - 147616) ActiveSegments= 1 (147617 - 147616) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 679147264 (679147264 - 0) DEBUG_POOL Name= 0 SinceSnapshotActiveSegments= 0 (0 - 0) ActiveSegments= -7403626 (61988581 - 69392207) SnapshotActiveSegments= -7403626 (61988581 - 69392207) AllocatedBytes= 3193199716 (3193199716 - 0) DEBUG_POOL Name= 1 SinceSnapshotActiveSegments= 1 (159896 - 159895) ActiveSegments= 1 (159896 - 159895) SnapshotActiveSegments= 0 (0 - 0) AllocatedBytes= 667083904 (667083904 - 0)

Checksum GoodComments BadComments NeutralComments

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit dc5663f updated->10149b6048e1dfc1e3b813066218bcec038059db

Toolchain Script [boisvert@miralac1 biosal-tests]$ cat soil-690-4096x16-2.sh

!/bin/bash

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 4096 \ -t 03:00:00 \ -O soil-690-4096x16-2 \ --mode c1 \ spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o soil-690-4096x16-2

Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-4096x16-2.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 333535

MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-4096x16-2.output TIMER [Load input / Count input data] 51.708893 seconds TIMER [Load input / Distribute input data] 4 minutes, 28.937408 seconds TIMER [Load input] 5 minutes, 20.646301 seconds TIMER [Build assembly graph / Distribute vertices] 10 minutes, 25.557434 seconds TIMER [Build assembly graph / Distribute arcs] 20 minutes, 57.344849 seconds TIMER [Build assembly graph] 31 minutes, 22.902344 seconds

MemoryUtilization Checksum GoodComments BadComments NeutralComments

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit dc5663f

Toolchain Script [boisvert@miralac1 biosal-tests]$ cat soil-690-512x16-1.sh

!/bin/bash

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 512 \ -t 03:00:00 \ -O soil-690-512x16-1 \ --mode c1 \ spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o soil-690-512x16-1

Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-512x16-1.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 333537

MachineUtilization ComputationLoad RunningTime out of memory, I now solved the issue.

MemoryUtilization Checksum GoodComments BadComments NeutralComments

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit 10149b6048e1dfc1e3b813066218bcec038059db

Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-1024x16-2.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 333623

MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-1024x16-2.output TIMER [Load input / Count input data] 52.247120 seconds TIMER [Load input / Distribute input data] 1 minutes, 47.248177 seconds TIMER [Load input] 2 minutes, 39.495300 seconds TIMER [Build assembly graph / Distribute vertices] 10 minutes, 10.982605 seconds TIMER [Build assembly graph / Distribute arcs] 12 minutes, 52.842712 seconds TIMER [Build assembly graph] 23 minutes, 3.825317 seconds

MemoryUtilization Checksum [boisvert@miralac1 biosal-tests]$ sha1sum soil-690-1024x16-2/coverage_distribution.txt-canonical 01a293db48518190038eaddbaed8a47ca0323fc7 soil-690-1024x16-2/coverage_distribution.txt-canonical

GoodComments BadComments NeutralComments

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit 10149b6048e1dfc1e3b813066218bcec038059db

Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-512x16-2.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 333622

MachineUtilization ComputationLoad RunningTime MemoryUtilization [boisvert@miralac1 biosal-tests]$ grep Error soil-690-512x16-2.output DEBUG Error bsal_memory_allocate returned (nil), 1040187392 bytes

Checksum GoodComments BadComments NeutralComments

sebhtml commented 10 years ago

mira is in maintenance.

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit 96a356661370cbdd14961fa5f8b16f8f6949d18b

Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-512x16-3.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 334227

MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-512x16-3.output TIMER [Load input / Count input data] 1 minutes, 5.614647 seconds TIMER [Load input / Distribute input data] 1 minutes, 32.000870 seconds TIMER [Load input] 2 minutes, 37.615524 seconds

MemoryUtilization [boisvert@miralac1 biosal-tests]$ grep Error soil-690-512x16-3.output DEBUG Error bsal_memory_allocate returned (nil), 1040187392 bytes ActiveSegments= 127326

Checksum GoodComments BadComments NeutralComments

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit 96a356661370cbdd14961fa5f8b16f8f6949d18b

Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-1024x16-3.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 334228

MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-1024x16-3.output TIMER [Load input / Count input data] 43.734001 seconds TIMER [Load input / Distribute input data] 1 minutes, 57.532341 seconds TIMER [Load input] 2 minutes, 41.266342 seconds TIMER [Build assembly graph / Distribute vertices] 10 minutes, 4.200684 seconds TIMER [Build assembly graph / Distribute arcs] 12 minutes, 47.443115 seconds TIMER [Build assembly graph] 22 minutes, 51.643799 seconds

MemoryUtilization [boisvert@miralac1 biosal-tests]$ grep Error soil-690-1024x16-3.output DEBUG Error bsal_memory_allocate returned (nil), 8192 bytes used / total -> 17090740224 / 17163091968

Checksum GoodComments BadComments NeutralComments

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit 96a356661370cbdd14961fa5f8b16f8f6949d18b 84bb626d22252a5d4603647e0255d1c9876f97b7 5e0b35619e7b3f663e5663896d9e761004135954

Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-2048x16-3.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 334229

MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-2048x16-3.output TIMER [Load input / Count input data] 37.634472 seconds TIMER [Load input / Distribute input data] 2 minutes, 53.514832 seconds TIMER [Load input] 3 minutes, 31.149307 seconds TIMER [Build assembly graph / Distribute vertices] 6 minutes, 47.528839 seconds

MemoryUtilization Checksum GoodComments BadComments out of memory

NeutralComments

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit 96a356661370cbdd14961fa5f8b16f8f6949d18b 84bb626d22252a5d4603647e0255d1c9876f97b7 5e0b35619e7b3f663e5663896d9e761004135954

Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-4096x16-3.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 334231

MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-4096x16-3.output TIMER [Load input / Count input data] 1 minutes, 3.382736 seconds TIMER [Load input / Distribute input data] 4 minutes, 24.529755 seconds TIMER [Load input] 5 minutes, 27.912506 seconds TIMER [Build assembly graph / Distribute vertices] 97 minutes, 42.688477 seconds

MemoryUtilization Checksum GoodComments BadComments out of memory

NeutralComments

sebhtml commented 10 years ago

[boisvert@miralac1 biosal-tests]$ qstat -f -u boisvert JobID JobName User Score WallTime QueuedTime RunTime Nodes State Location Mode Procs Queue StartTime

334231 soil-690-4096x16-3 boisvert 916.3 04:00:00 103:15:14 N/A 4096 queued None c1 4096 prod-short N/A
334229 soil-690-2048x16-3 boisvert 1.1 K 03:00:00 103:15:59 N/A 2048 queued None c1 2048 prod-short N/A

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit 84bb626d22252a5d4603647e0255d1c9876f97b7 5e0b35619e7b3f663e5663896d9e761004135954

Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-512x16-4.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 336506

MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-512x16-4.output TIMER [Load input / Count input data] 1 minutes, 2.588917 seconds TIMER [Load input / Distribute input data] 1 minutes, 38.910957 seconds TIMER [Load input] 2 minutes, 41.499878 seconds

MemoryUtilization Checksum GoodComments BadComments out of memory

NeutralComments

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit 84bb626d22252a5d4603647e0255d1c9876f97b7 5e0b35619e7b3f663e5663896d9e761004135954

Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-1024x16-4.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 336507

MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 biosal-tests]$ grep TIMER soil-690-1024x16-4.output TIMER [Load input / Count input data] 38.235104 seconds TIMER [Load input / Distribute input data] 1 minutes, 46.555290 seconds TIMER [Load input] 2 minutes, 24.790390 seconds

MemoryUtilization out of memory

Checksum GoodComments BadComments NeutralComments

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit 04fe9b871d9fc480a90696f1bf12073c18a6abf3 9a2fd78d2ffcc350bdc4e1b653208c35f44989ac

Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-512x16-5.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 337447

MachineUtilization ComputationLoad RunningTime cancelled

MemoryUtilization Checksum GoodComments BadComments NeutralComments

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit 04fe9b871d9fc480a90696f1bf12073c18a6abf3 9a2fd78d2ffcc350bdc4e1b653208c35f44989ac

Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-1024x16-5.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 337448

MachineUtilization ComputationLoad RunningTime SIGSEGV because of memory usage

MemoryUtilization Checksum GoodComments BadComments NeutralComments

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit 04fe9b871d9fc480a90696f1bf12073c18a6abf3 9a2fd78d2ffcc350bdc4e1b653208c35f44989ac

Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-2048x16-4.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 337449

MachineUtilization ComputationLoad RunningTime cancelled

MemoryUtilization Checksum GoodComments BadComments NeutralComments

sebhtml commented 10 years ago

JobName Machine AllocationStatus Path Commit 04fe9b871d9fc480a90696f1bf12073c18a6abf3 9a2fd78d2ffcc350bdc4e1b653208c35f44989ac

Toolchain Script Submission [boisvert@miralac1 biosal-tests]$ ./soil-690-4096x16-4.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 337450

MachineUtilization ComputationLoad RunningTime cancelled

MemoryUtilization Checksum GoodComments BadComments NeutralComments

sebhtml commented 10 years ago

will wait for new allocation to kick in

sebhtml commented 10 years ago

dependencies #709 #723

JobName Goal Evaluate Spate on Blue Gene/Q since Beagle will be down until December 2014

Machine AllocationStatus Path Commit 4908fd9090bd79d9373021eb9b746c65bfb414bc

Toolchain Script Submission Submitted build spate-2014-11-10-18-34-23 (4908fd9090bd79d9373021eb9b746c65bfb414bc) 361599

MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 automated-tests]$ grep TIMER spate-2014-11-10-18-34-23.output TIMER [Load input / Count input data] 1 minutes, 1.136402 seconds TIMER [Load input / Distribute input data] 2 minutes, 44.771774 seconds TIMER [Load input] 3 minutes, 45.908173 seconds TIMER [Build assembly graph / Distribute vertices] core_manager/1055741 dies [boisvert@miralac1 automated-tests]$ grep minutes spate-2014-11-10-18-34-23.output TIMER [Load input / Count input data] 1 minutes, 1.136402 seconds TIMER [Load input / Distribute input data] 2 minutes, 44.771774 seconds TIMER [Load input] 3 minutes, 45.908173 seconds 28 minutes, 48.453003 seconds

MemoryUtilization Checksum GoodComments BadComments NeutralComments

sebhtml commented 9 years ago

Last good:

spate-2014-10-27-17-08-34 (b3592bc) https://github.com/GeneAssembly/biosal/issues/775#issuecomment-60637238 the multiplexer was disabled.

sebhtml commented 9 years ago

JobName Goal Verify if the slow-down is caused by the multiplexer. If it is the case, ticket #805 should fix it. In this run, assertions were turned on

Machine AllocationStatus Path Commit ed8f3d2e4d82df226dd2a3dc1d998135acceb1dc

Toolchain Script [boisvert@miralac1 automated-tests]$ cat spate-2014-11-11-20-05-40.sh

!/bin/bash

echo "Commit= ed8f3d2e4d82df226dd2a3dc1d998135acceb1dc"

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 01:00:00 \ -O spate-2014-11-11-20-05-40 \ --mode c1 \ spate-2014-11-11-20-05-40.spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o spate-2014-11-11-20-05-40 -disable-multiplexer

Submission [boisvert@miralac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh

(added -disable-multiplexer)

[boisvert@miralac1 automated-tests]$ ./spate-2014-11-11-20-05-40.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 362328

MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 automated-tests]$ grep TIMER spate-2014-11-11-20-05-40.output TIMER [Load input / Count input data] 40.701588 seconds TIMER [Load input / Distribute input data] 1 minutes, 43.710915 seconds TIMER [Load input] 2 minutes, 24.412506 seconds TIMER [Build assembly graph / Distribute vertices] 22 minutes, 21.978760 seconds TIMER [Build assembly graph / Distribute arcs] 28 minutes, 13.162842 seconds TIMER [Build assembly graph] 50 minutes, 35.141602 seconds

MemoryUtilization Checksum GoodComments BadComments NeutralComments

sebhtml commented 9 years ago

Goal Test without assertions

test on Mira with new build script (CONFIG_DEBUG=n) since DEBUG=y generates much slower code.

[boisvert@miralac1 biosal]$ tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh

Submitted build spate-2014-11-12-04-00-21 (211d4e1893bc70617764ba534068e3d2f196dbdd) 362582

[boisvert@miralac1 automated-tests]$ grep TIMER spate-2014-11-12-04-00-21.output TIMER [Load input / Count input data] 42.059803 seconds TIMER [Load input / Distribute input data] 4 minutes, 3.187134 seconds TIMER [Load input] 4 minutes, 45.246918 seconds TIMER [Build assembly graph / Distribute vertices] 14 minutes, 26.619629 seconds TIMER [Build assembly graph / Distribute arcs] 28 minutes, 12.110107 seconds TIMER [Build assembly graph] 42 minutes, 38.729736 seconds

sebhtml commented 9 years ago

JobName Goal Test without multiplexer and without assertions

Machine Mira

AllocationStatus Path Commit 211d4e1893bc70617764ba534068e3d2f196dbdd

Toolchain Script [boisvert@miralac1 automated-tests]$ cat spate-2014-11-12-15-48-15.sh

!/bin/bash

echo "Commit= 211d4e1893bc70617764ba534068e3d2f196dbdd"

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 1024 \ -t 01:00:00 \ -O spate-2014-11-12-15-48-15 \ --mode c1 \ spate-2014-11-12-15-48-15.spate -print-load -threads-per-node 16 \ -k 43 Iowa_Continuous_Corn/*.fastq \ -o spate-2014-11-12-15-48-15 -disable-multiplexer

Submission [boisvert@miralac1 automated-tests]$ ./spate-2014-11-12-15-48-15.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 362815

MachineUtilization ComputationLoad RunningTime [boisvert@miralac1 automated-tests]$ grep TIMER spate-2014-11-12-15-48-15.output TIMER [Load input / Count input data] 40.135349 seconds TIMER [Load input / Distribute input data] 1 minutes, 43.414711 seconds TIMER [Load input] 2 minutes, 23.550064 seconds TIMER [Build assembly graph / Distribute vertices] 11 minutes, 20.589844 seconds TIMER [Build assembly graph / Distribute arcs] 22 minutes, 30.075806 seconds TIMER [Build assembly graph] 33 minutes, 50.665771 seconds

MemoryUtilization thorium_node: node/538 METRICS AliveActorCount: 1476 ByteCount: 14148833280 / 17163091968

Checksum [boisvert@miralac1 automated-tests]$ sha1sum spate-2014-11-12-15-48-15/coverage_distribution.txt-canonical 01a293db48518190038eaddbaed8a47ca0323fc7 spate-2014-11-12-15-48-15/coverage_distribution.txt-canonical

GoodComments -disable-multiplexer won't be required because the multiplexer is now adaptive.

BadComments It ran out of memory because of all these small messages.

NeutralComments

sebhtml commented 9 years ago

JobName Goal Test adaptive multiplexer algorithm and lower k-mer length

Machine AllocationStatus Path Commit 509472f94bdbb14b13346ee5d3a9573763a2f5f3

Toolchain Script Submission [boisvert@miralac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-13-22-09-23 (509472f94bdbb14b13346ee5d3a9573763a2f5f3) 364013

MachineUtilization ComputationLoad RunningTime cancelled, problem with timeout

MemoryUtilization Checksum GoodComments BadComments NeutralComments

sebhtml commented 9 years ago

JobName Goal Test manual activation of multiplexer

Machine AllocationStatus Path Commit Toolchain Script Submission

[boisvert@miralac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh \ Project 'compbio'; job rerouted to queue 'prod-short' Submitted build spate-2014-11-14-02-46-11 (b0ecb745f90d7261a116a6175c7e57beb4b4b680) 364200

MachineUtilization ComputationLoad RunningTime

[boisvert@cetuslac1 automated-tests]$ grep TIMER spate-2014-11-14-02-46-11.output
TIMER [Load input / Count input data] 39.870667 seconds TIMER [Load input / Distribute input data] 1 minutes, 42.208771 seconds TIMER [Load input] 2 minutes, 22.079437 seconds TIMER [Build assembly graph / Distribute vertices] 13 minutes, 29.460754 seconds TIMER [Build assembly graph / Distribute arcs] 26 minutes, 22.444458 seconds TIMER [Build assembly graph] 39 minutes, 51.905273 seconds

MemoryUtilization Ran out of memory here:

DEBUG Error core_memory_allocate returned (nil), 8388608 bytes used / total -> 15699095552 / 17163091968

[boisvert@cetuslac1 automated-tests]$ addr2line -e spate-2014-11-14-02-46-11.spate < spate-2014-11-14-02-46-11.stack /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/tracer.c:42 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory.c:128 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory_pool.c:148 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/structures/vector.c:181 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/helpers/vector_helper.c:351 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/patterns/manager.c:314 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/actor.c:1207 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:1909 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:1685 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:713 /bgsys/drivers/V1R2M2/ppc64/toolchain/gnu/glibc-2.12.2/nptl/pthread_create.c:322 :0

[boisvert@cetuslac1 automated-tests]$ grep GRAPH spate-2014-11-14-02-46-11.output GRAPH -> 148043434448 vertices, 324996128800 vertex observations, and 131189785122 arcs.

This is 217 bytes per k-mer...

irb(main):002:0> 15699095552 / (148043434448 / 2 / 1024) => 217

biosal_assembly_sliding_window/1051763 kmer length is 27

So there is a lot of memory fragmentation.

after loading input data: thorium_node: node/385 METRICS AliveActorCount: 61 ByteCount: 1937833984 / 17163091968

after vertices extraction: thorium_node: node/60 METRICS AliveActorCount: 61 ByteCount: 13802754048 / 17163091968

thorium_node: node/958 METRICS AliveActorCount: 61 ByteCount: 15634251776 / 17163091968

http://www.idris.fr/eng/turing/turing-fragmentation_memoire-eng.html

Checksum GoodComments BadComments NeutralComments

sebhtml commented 9 years ago

needs: #819 #820

sebhtml commented 9 years ago

Goal Check if there is less fragmentation now.

[boisvert@miralac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-14-23-39-40 (558c83b2ccf26061031a66a9dafeb41492712931) 364888

sebhtml commented 9 years ago

Goal Test with normalized allocation sizes.

[boisvert@cetuslac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-15-04-07-25 (c4d7a9d0e8f66104b750fbac825532425fa4e39a) 364993

[boisvert@miralac1 automated-tests]$ grep TIMER spate-2014-11-15-04-07-25.output
TIMER [Load input / Count input data] 41.663147 seconds TIMER [Load input / Distribute input data] 1 minutes, 42.034645 seconds TIMER [Load input] 2 minutes, 23.697800 seconds TIMER [Build assembly graph / Distribute vertices] 12 minutes, 53.424500 secondscore_manager/1055741 dies TIMER [Build assembly graph / Distribute arcs] 25 minutes, 31.461914 seconds TIMER [Build assembly graph] 38 minutes, 24.886475 seconds

[boisvert@miralac1 automated-tests]$ grep " ByteCount" spate-2014-11-15-04-07-25.output |tail -n1 thorium_node: node/1021 METRICS AliveActorCount: 62 ByteCount: 16198402048 / 17163091968

still too much memory fragmentation

sebhtml commented 9 years ago

Goal

Try with less memory fragmentation

[boisvert@cetuslac1 biosal]$ ./tests/Cetus_IBM_Blue_Gene_Q/launch-Spate-Iowa-Contiunous-Corn.sh Submitted build spate-2014-11-16-04-27-22 (032b24c9a77afc711c54718bc058f446dcb58f2f) 365409

job after "TIMER [Build assembly graph / Distribute vertices]" after "TIMER [Build assembly graph / Distribute arcs]"
spate-2014-11-15-04-07-25 ByteCount: 11990700032 / 17163091968 ByteCount: 16198402048 / 17163091968
spate-2014-11-16-04-27-22 ByteCount: 12129284096 / 17163091968 ByteCount: 13922926592 / 17163091968

DEBUG Error core_memory_allocate returned (nil), 8388608 bytes used / total -> 13805076480 / 17163091968 DEBUG_NODE Name= 413 WorkerCount= 15 MEMORY used / total -> 13805076480 / 17163091968

This is some nasty memory fragmentation. CNK is very bad at memory management.

Stack: [boisvert@cetuslac1 automated-tests]$ addr2line -e spate-2014-11-16-04-27-22.spate < spate-2014-11-16-04-27-22.stack /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/tracer.c:42 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory.c:135 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory_pool.c:342 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/system/memory_pool.c:278 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/structures/vector.c:181 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/helpers/vector_helper.c:351 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/core/patterns/manager.c:321 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/actor.c:1225 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:1914 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:1690 /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests/biosal/engine/thorium/worker.c:718 /bgsys/drivers/V1R2M2/ppc64/toolchain/gnu/glibc-2.12.2/nptl/pthread_create.c:322 :0 ??:0

Memory usage:

Key Name Reported Actors Total
0x84a83916 MEMORY_POOLNAME SEQUENCE_STORE 30801920 15 462028800
0x46d316e4 MEMORY_POOLNAME WORKER_OUTBOUND

[boisvert@cetuslac1 automated-tests]$ size spate-2014-11-16-04-27-22.spate text data bss dec hex filename 8327049 1192404 1097452 10616905 a20049 spate-2014-11-16-04-27-22.spate

sebhtml commented 9 years ago

jemalloc test: #821

sebhtml commented 9 years ago

JobName spate-690-2048x16-1

Goal Do a complete assembly on Iowa Continuous Corn on Mira.

Machine Mira @ ALCF AllocationStatus Available= 837120.3

Path /gpfs/mira-fs1/projects/CompBIO/Projects/automated-tests

Commit feb224e93 stock, no patch

Toolchain XL

Script [boisvert@miralac1 automated-tests]$ cat spate-690-2048x16-1.sh

!/bin/bash

feb224e93

qsub \ --env PAMID_THREAD_MULTIPLE=1 \ -A CompBIO \ -n 2048 \ -t 03:00:00 \ -O spate-690-2048x16-1 \ --mode c1 \ spate-690-2048x16-1.spate -print-load -threads-per-node 16 \ -k 33 Iowa_Continuous_Corn/*.fastq \ -o spate-690-2048x16-1

Submission [boisvert@miralac1 automated-tests]$ ./spate-690-2048x16-1.sh \ Project 'compbio'; job rerouted to queue 'prod-short' 369649

MachineUtilization ComputationLoad RunningTime [boisvert@cetuslac1 automated-tests]$ grep TIMER spate-690-2048x16-1.output TIMER [Load input / Count input data] 48.354530 seconds TIMER [Load input / Distribute input data] 2 minutes, 51.379822 seconds TIMER [Load input] 3 minutes, 39.734344 seconds TIMER [Build assembly graph / Distribute vertices] 6 minutes, 17.040649 seconds TIMER [Build assembly graph / Distribute arcs] 11 minutes, 45.084351 seconds TIMER [Build assembly graph] 18 minutes, 2.125000 seconds

MemoryUtilization Checksum GoodComments [boisvert@cetuslac1 automated-tests]$ grep visitors spate-690-2048x16-1.output DEBUG the system has 3072000 visitors

[boisvert@cetuslac1 automated-tests]$ grep GRAPH spate-690-2048x16-1.output GRAPH -> 148375705714 vertices, 298256036296 vertex observations, and 115990208309 arcs.

irb(main):001:0> 3072000_3500 => 10752000000 irb(main):003:0> 10752000000.0/148375705714_100 => 7.246469324786163

BadComments [boisvert@cetuslac1 automated-tests]$ tail spate-690-2048x16-1.cobaltlog -n1 Tue Nov 25 06:45:18 2014 +0000 (UTC) Info: maximum execution time exceeded; initiating job termination

NeutralComments

sebhtml commented 9 years ago

The ratio is too low for 2048x16:

node count: 2048 store count: 204815 = 30720 visitor count per node: 15 \ 100 = 1500 ratio: 2048 / 1500 = 1.36

for 256x24:

ratio = 256 / (22 * 100) = 0.11

load

[boisvert@cetuslac1 automated-tests]$ grep "node/1543" spate-690-2048x16-1.output | grep LOAD |tail thorium_worker_pool: node/1543 EPOCH LOAD 9523 s 0.20/15 (0.01) 0.02 0.01 0.01 0.02 0.02 0.01 0.02 0.01 0.01 0.01 0.01 0.02 0.01 0.02 0.01 thorium_worker_pool: node/1543 EPOCH LOAD 9533 s 0.18/15 (0.01) 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.02 0.01 0.01 thorium_worker_pool: node/1543 EPOCH LOAD 9543 s 0.25/15 (0.02) 0.02 0.01 0.02 0.02 0.01 0.02 0.02 0.01 0.02 0.01 0.02 0.02 0.03 0.02 0.01 thorium_worker_pool: node/1543 EPOCH LOAD 9553 s 0.46/15 (0.03) 0.02 0.04 0.03 0.03 0.03 0.02 0.03 0.04 0.03 0.04 0.02 0.02 0.04 0.02 0.04 thorium_worker_pool: node/1543 EPOCH LOAD 9563 s 1.24/15 (0.08) 0.04 0.21 0.04 0.04 0.04 0.04 0.04 0.03 0.04 0.25 0.06 0.04 0.03 0.04 0.33 thorium_worker_pool: node/1543 EPOCH LOAD 9573 s 1.45/15 (0.10) 0.36 0.04 0.03 0.04 0.03 0.04 0.04 0.03 0.27 0.03 0.03 0.39 0.04 0.04 0.04 thorium_worker_pool: node/1543 EPOCH LOAD 9583 s 1.14/15 (0.08) 0.24 0.02 0.03 0.03 0.03 0.03 0.02 0.02 0.03 0.36 0.03 0.24 0.02 0.03 0.02 thorium_worker_pool: node/1543 EPOCH LOAD 9593 s 1.02/15 (0.07) 0.36 0.02 0.02 0.03 0.03 0.02 0.02 0.03 0.03 0.02 0.03 0.36 0.02 0.03 0.02 thorium_worker_pool: node/1543 EPOCH LOAD 9603 s 1.33/15 (0.09) 0.02 0.02 0.02 0.02 0.45 0.02 0.02 0.02 0.02 0.01 0.02 0.03 0.01 0.02 0.62 thorium_worker_pool: node/1543 EPOCH LOAD 9613 s 2.03/15 (0.14) 0.01 0.02 0.00 0.01 0.02 0.01 0.00 0.38 0.41 0.53 0.01 0.01 0.01 0.00 0.62

messaging rate per node

irb(main):002:0> (77721189 - 77684590) / 10.0 => 3659.9

sebhtml commented 9 years ago

better ratio: irb(main):015:0> 2048 / (15 * 100.0) => 1.3653333333333333 irb(main):016:0> 2048 / (15 * 400.0) => 0.3413333333333333

sebhtml commented 9 years ago

needs: #829