cbirdlab / dDocentHPC

hard fork of dDocent, edited to run without interactive user input
2 stars 5 forks source link

Make Freebayes Run Multinode and Reduce RAM Hogging #7

Open cbird808 opened 4 years ago

cbird808 commented 4 years ago

HI Guys,

Before go to sleep tonight, I’d like to post some update, it’s pretty good news. I can confirm to run 120 (117 actually, I think that is the maximum division can be done prior to freebayes stage).

From my understanding of dDocentHPC, I think multi node freebayes run can be relatively easy to achieve, if this is turing cluster, then this can be done as simple as add srun before freebayes command in parallel arguments.

However, recently I started to deploy software through container, this become somewhat problematic. As all software (dDocentHPC.bash) is working inside container and scheduler (dDocentHPC.sbatch) is working outside. My initial idea was to change dDocentHPC.bash to make it create a “execution plan” from freebayes stage. As far as I can tell there is no more condition depends on previous code anymore, and everything is determined at this point. I can then run this plan from sbatch job script, since both inside and outside do share storage. I got this somewhat working, but I want to make sure Chris can actually use it, my modification was basically only works in ODU hpc environment, it’s rather unportable.

So I changed my direction to focus on getting job submission to work from inside the container, it took some work, but I am glad to say it is working now. And the benefit of this all done during container building, so you do not need to worry about any of it.

Now the only change that is required is here, previously:

parallel -j $NUMProc "freebayes -b cat.$ ……

now:

parallel -j $NUMProc --delay 5 "srun -n 1 -r \$(expr \$PARALLEL_SEQ % $SLURM_NNODES) crun env LD_PRELOAD=/opt/conda/lib/libjemalloc.so freebayes -b cat.$ ….

This a little bit crazy looking line, but it will let us run freebayes across nodes, let me explain:

srun slurm step launcher, SGE/Torque/PBS may need use Job Array here

-n 1 1 task

-r \$(expr \$PARALLEL_SEQ % $SLURM_NNODES) This is how to make slurm spread job evenly among nodes, it’s quiet weird that it does not spread automatically ..

crun It’s actually “singularity run container_image.sif”, I made crun to save typing

env LD_PRELOAD=/opt/conda/lib/libjemalloc We have to move jemalloc settings down to here, it can no longer applies globally, because SLURM will carry it out and try to apply it outside of crun, and it will cause a lot of failure. Setting it here will only applies for freebayes command

For job script, we can do:

SBATCH -n 120

SBATCH --ntasks-per-node 40

If everything is alright, then we can see job steps like this:

   JobID    JobName      State  AllocCPUS        NodeList

40098 mkVCF_Ssp+ RUNNING 120 d4-w6420b-[01,+ 40098.batch batch RUNNING 40 d4-w6420b-01 40098.0 crun RUNNING 1 d4-w6420b-03 40098.1 crun RUNNING 1 d4-w6420b-04 40098.2 crun RUNNING 1 d4-w6420b-01 …. 40098.107 crun RUNNING 1 d4-w6420b-01 40098.108 crun RUNNING 1 d4-w6420b-03 40098.109 crun RUNNING 1 d4-w6420b-04 40098.110 crun RUNNING 1 d4-w6420b-01 40098.111 crun RUNNING 1 d4-w6420b-03 40098.112 crun RUNNING 1 d4-w6420b-04 40098.113 crun RUNNING 1 d4-w6420b-01 40098.114 crun RUNNING 1 d4-w6420b-03 40098.115 crun RUNNING 1 d4-w6420b-04 40098.116 crun RUNNING 1 d4-w6420b-01 40098.117 crun RUNNING 1 d4-w6420b-03

I can confirm that each of the node running 39 freebayes (again, it seems to split to 117 and that is the max ?)

With this configuration, it also enables us to do something more interesting, for instance I can ask the scheduler 120 tasks, not on the same 3 node, I can spread it among multiple machines. Or I can aske for 3 and half of node but still have 120 thread in config.4.all.cbirdq, the reason to do that is from my last 40 task run:

As you can see that big spike there, that is actually pretty bad, system issued OOM kill there.

With more choices of spreading the job, I think we done with the memory issue once for all.

I will come back check on the test run status tomorrow morning, I am pretty optimistic about it 🤞 I will update you once I have something, you can check with command below too, if you are interested:

sacct -j 40098 -o jobid,jobname,state,alloccpus,nodelist

Best, Min Dong

From: Bird, Chris Chris.Bird@tamucc.edu Sent: Sunday, May 3, 2020 7:26 PM To: Dong, Min mdong@odu.edu; Carpenter, Kent E. kcarpent@odu.edu; Garcia, Eric e1garcia@odu.edu Cc: hpc hpc@odu.edu Subject: Re: salloc problem on 38164: COMPLETED

Hi Min,

This is great! Thank you!

I maintain dDocentHPC and faster is better.

Ideally, the user would be able to specify the number of nodes to use in the config file and then they can be utilized by freebayes. If it works well, I'll accept the pull req and incorporate it into the master.

Fwiw, the other part of dDocentHPC mkVCF that takes a lot of time is creating the cat*bam file. I have it so that once that file is created, it will be used rather than overwritten if mkVCF is run again.

Cheers,

Chris

Get Outlook for Android


From: Dong, Min mdong@odu.edu Sent: Sunday, May 3, 2020 12:52:53 AM To: Kent Carpenter kcarpent@odu.edu; Bird, Chris Chris.Bird@tamucc.edu; Garcia, Eric e1garcia@odu.edu Cc: hpc hpc@odu.edu Subject: RE: salloc problem on 38164: COMPLETED

Hi Guys,

I have good new, it took quite some time but I figured out a way to deal with memory issue with freebayes.

After researching freebayes site/code, I cannot really find any way to tune it directly, the couple way suggested on the github page require some compromise on the actual calculation (or at least that is my guess, I honestly don’t understand what the author mean due to lack of domain knowledge).

But during profiling freebayes, I found that the memory of freebayes seems to be VERY fragmented, if I could condense the memory somehow, I might be able to run more freebayes on each machine. After all, freebayes seems to be most time taken part of the job, more freebayes means faster job. After some trial and error, I find using jemalloc to replace default malloc call can achieve this.

“malloc” is a C library call that responsible for asking memory from OS and give it to application. This process is actually not entirely done inside the Linux kernel, the kernel provide “sbrk” system call and C library provide an algorithm through “malloc” to best use memory. Same as majority of the Linux world I have my cluster use malloc from Gnu C Library (glibc), this is the standard malloc, it generally perform well and stable, let’s just say there is reason everybody using it .

“jemalloc” provide the same malloc call but done with a different algorithm, it is not better than glibc per say, it perform really better than glibc sometimes, it also works pretty bad sometimes, it also give more fine controls if programmer call its own api instead of just standard malloc. So by default very few people use it by default system wide, specific application may decide to use it if their benchmarked/profiled their application and concluded it is more effective to use. Luckily, I tried it with freebayes, and it works very well.

Here is the memory usage of 25 processes of freebayes using jemalloc:

Here is the memory usage of 25 processes of freebayes using glibc malloc:

This memory usage is same as job 38164 from Dr. Carpenter, this what the memory looks like for running 20 freebayes for about 2 hour, this data is from Apr 27th

I also get same error and OOM kill from my glibc run:

Sat May 2 22:41:48 EDT 2020 Genotyping individuals of ploidy 2 using freebayes... terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc

[Sun May 3 00:00:56 2020] Out of memory: Kill process 10412 (freebayes) score 109 or sacrifice child [Sun May 3 00:00:56 2020] Killed process 10412 (freebayes) total-vm:51300260kB, anon-rss:43296152kB, file-rss:144kB, shmem-rss:0kB [Sun May 3 00:15:25 2020] Out of memory: Kill process 10401 (freebayes) score 80 or sacrifice child [Sun May 3 00:15:25 2020] Killed process 10401 (freebayes) total-vm:36541940kB, anon-rss:31738116kB, file-rss:0kB, shmem-rss:0kB [Sun May 3 00:23:13 2020] Out of memory: Kill process 10402 (freebayes) score 80 or sacrifice child [Sun May 3 00:23:14 2020] Killed process 10402 (freebayes) total-vm:31620152kB, anon-rss:31564580kB, file-rss:4kB, shmem-rss:0kB [Sun May 3 00:27:33 2020] Out of memory: Kill process 10419 (freebayes) score 81 or sacrifice child [Sun May 3 00:27:33 2020] Killed process 10419 (freebayes) total-vm:34518700kB, anon-rss:32269844kB, file-rss:0kB, shmem-rss:0kB [Sun May 3 00:33:06 2020] Out of memory: Kill process 10413 (freebayes) score 78 or sacrifice child [Sun May 3 00:33:06 2020] Killed process 10413 (freebayes) total-vm:31922388kB, anon-rss:31026740kB, file-rss:0kB, shmem-rss:0kB

As you can see, for 25 process of freebayes with jemalloc the system seems to handle it quiet easily, almost half of memory is still free so I think running 40 freebayes is not a problem. I am running a test right now, but tbh, I am quite certain about it.

So I will list modification needed to achieve this, but please don’t rush to edit your dDocentHPC.bash yet. There are more later:

  1. Add “export LD_PRELOAD=/opt/conda/lib/libjemalloc.so” (without quotation) right after VERSION=4.3, this make jemalloc replace glibc malloc for all command invoked in dDocentHPC.bash
  2. Add “--delay 300” in the “parallel” command that launches freebayes, this is a little bit long:

            ls mapped.*.$CUTOFFS.bed | sed 's/mapped.//g' | sed 's/.bed//g' | shuf | parallel --delay 300 --no-notice -j $NUMProc "freebayes -b cat.$CUTOFFS-RRG.bam -t mapped.{}.bed -v raw.{}.vcf -f reference.$CUTOFFS.fasta -p $PLOIDY -n $BEST_N_ALLELES -m $MIN_MAPPING_QUAL -q $MIN_BASE_QUAL -E $HAPLOTYPE_LENGTH --min-repeat-entropy $MIN_REPEAT_ENTROPY --min-coverage $MIN_COVERAGE -F $MIN_ALT_FRACTION -C $FREEBAYES_C -G $FREEBAYES_G -3 $FREEBAYES_3 -e $FREEBAYES_e -z $FREEBAYES_z -Q $FREEBAYES_Q -U $FREEBAYES_U -$ $FREEBAYES_DOLLAR --populations popmap.$CUTOFFS ${FREEBAYES_r}${FREEBAYES_report_monomorphic}${FREEBAYES_w}${FREEBAYES_V}${FREEBAYES_a}${FREEBAYES_no_partial_observations}"

The “--delay 300” added here will make freebayes launch 5 mins apart from each other, yes it will waste some time for 40 processes, but compare with overall running time this does not seems to be too much to ask. We can also adjust the time here, maybe 1 min apart is good enough. The reason of add this is that I found , although jemalloc pretty much fixed the memory issue, but there are still a few memory spikes, and from my observation, couple freebayes had spikes in close time frame, I am not entirely sure why, but given they are using same code running on same cpu dealing with similar data, it’s probably reasonable and not need to be worried about. Adding delay here is just to avoid the spikes to happen in the same time, without spikes in the same time the system can handle 40 process of freebayes rather easily.

NOW, this is regarding to further reduce ddocent run time. I find that most of time is spent on freebayes, all process prior to that completed rather fast, and do not utilize multi core that much, samtool use about 4 cores even given 25 core as input. Bedtools seems to run in single core only. Freebayes also runs in single core only, and it does not need communicate with any other process. This opens an opportunity for us, I can modify dDocentHPC.bash to make it into 3 stages. During the stage of freebayes, I will have dDocent launch freebayes on other node through my cluster scheduler. So I can give you ddocent with more than just 40 process, we can try 80, 120 …. You get my point.

It is not without drawback in this plan, that is you have to use my modified version of dDocentHPC.bash, I will have to modify it again if you want using a newer version. It’s probably not gonna be too much work after the first time, but it’s some additional work, so my question is how badly do you want this to be running faster ? Or is 40 cores good enough for you?

If 40 cores is good enough for you, I will give you a script with previous 2 changes already made, and as you can see, it’s relatively easy to change by yourself in the future.

If you do want it to run even faster, then please provide me a smaller dataset, where I can run all stage preferably in minutes not days. I need the data for testing to make sure I modified the script correctly and in this case I need data for all stages, so smaller dataset would be much nicer.

Please let me know what is your thought so I can prepare for next step.

Best, Min Dong

From: Carpenter, Kent E. kcarpent@odu.edu Sent: Friday, May 1, 2020 7:03 AM To: Bird, Chris chris.bird@tamucc.edu; Garcia, Eric e1garcia@odu.edu; Dong, Min mdong@odu.edu Cc: hpc hpc@odu.edu Subject: Re: salloc problem on 38164: COMPLETED

The run with the bad_alloc problem finally finished this morning. The output files look credible and there are no final error messages in the SLURM out file. I will take a look at these files later today to see how many contigs were produced. It will be interesting to see how these vcf files look compared to identical runs that had the number of threads reduced to prevent the bad_alloc problem.

Kent E. Carpenter Professor & Eminent Scholar Department of Biological Sciences, PSB 3120A Old Dominion University Norfolk, Virginia 23529-0266 USA & Manager, IUCN Global Marine Species Assessment/ IUCN Species Programme Marine Biodiversity Unit:https://sites.wp.odu.edu/GMSA/ Office Phone: (757) 683-4197


From: Bird, Chris Chris.Bird@tamucc.edu Sent: Thursday, April 30, 2020 2:35 PM To: Garcia, Eric e1garcia@odu.edu; Dong, Min mdong@odu.edu; Carpenter, Kent E. kcarpent@odu.edu Cc: hpc hpc@odu.edu Subject: RE: salloc problem on 38164

I believe that the ram is being taken up by loading the catbam file on each thread. Threads x catbamGB < physical ram GB

From: Garcia, Eric e1garcia@odu.edu Sent: Thursday, April 30, 2020 1:27 PM To: Dong, Min mdong@odu.edu; Kent Carpenter kcarpent@odu.edu; Bird, Chris Chris.Bird@tamucc.edu Cc: hpc hpc@odu.edu Subject: Re: salloc problem on 38164

Hey Min,

Here is the actual data causing the problem.

You only need to copy these files into a new directory: cp /home/kcarpent/PIRE_data/Ssp_Cap/less2_mkVCF/-RG.bam cp /home/kcarpent/PIRE_data/Ssp_Cap/less2_mkVCF/config.4.all.cbirdq cp /home/kcarpent/PIRE_data/Ssp_Cap/less2_mkVCF/reference.2.2.fasta cp /home/kcarpent/PIRE_data/Ssp_Cap/less2_mkVCF/dD

then execute: sbatch dDocentHPC.sbatch

Eric


From: Dong, Min mdong@odu.edu Sent: Thursday, April 30, 2020 1:40 PM To: Carpenter, Kent E. kcarpent@odu.edu; Garcia, Eric e1garcia@odu.edu; Bird, Chris chris.bird@tamucc.edu Cc: hpc hpc@odu.edu Subject: RE: salloc problem on 38164

Hmm, to know that, I will need to modify your ddocentHPC.bash script, but all of you have many copy of the script in many of your directories. Can you give me a directory that I can do the modify?

Also could you please give me some test data so I can assist in trouble shooting the problem? The one that is causing our problem right now would be best. I will make a copy of it so I don’t overwrite your data.

From: Carpenter, Kent E. kcarpent@odu.edu Sent: Thursday, April 30, 2020 12:33 PM To: Dong, Min mdong@odu.edu; Garcia, Eric e1garcia@odu.edu; Bird, Chris chris.bird@tamucc.edu Cc: hpc hpc@odu.edu Subject: Re: salloc problem on 38164

Thanks Min, This is all good to know. How can we also check to see if a process was killed? This does not appear in our out file. I guess we will let it keep going for a while longer since the memory usage has changed and it is still writing. Thanks, Kent

Kent E. Carpenter Professor & Eminent Scholar Department of Biological Sciences, PSB 3120A Old Dominion University Norfolk, Virginia 23529-0266 USA & Manager, IUCN Global Marine Species Assessment/ IUCN Species Programme Marine Biodiversity Unit:https://sites.wp.odu.edu/GMSA/ Office Phone: (757) 683-4197


From: Dong, Min mdong@odu.edu Sent: Thursday, April 30, 2020 12:24 PM To: Carpenter, Kent E. kcarpent@odu.edu; Garcia, Eric e1garcia@odu.edu; Bird, Chris chris.bird@tamucc.edu Cc: hpc hpc@odu.edu Subject: RE: salloc problem on 38164

HI Dr. Carpenter,

For you question 2, I store the records for 15 days, data older than that get deleted day by day. I do this because this site is more for trouble shooting, there are more permanent record you can find on http://xdmod.hpc.odu.edu/

For you question 1, I think the causation more likely to be reversed here. It’s more likely that there is problem in either input or freebayes itself, that caused memory issue and huge output. There is actually no direct connection between memory usage and input/output size. Not all program need to read all the data in at once, the same applies to output. Of course this is highly depends on the algorithm of the code, certain computation just need to have all the data in memory. That said, normally, same program should take about same amount of memory if the input size is about the same. So if 1 of freebayes process consume significantly more memory then rest it is more likely to be a issue.

Regarding to job completion, have you launched just 1 freebayes? There is a chance that the problematic freebayes is dead, but the rest is keep going, that would appears the job can be finished. As far I can tell on your job 38164, there are actually out of memory kills occurred multiple times:

[Tue Apr 28 05:57:14 2020] Killed process 11979 (freebayes) total-vm:31995976kB, anon-rss:28824652kB, file-rss:120kB, shmem-rss:0kB [Tue Apr 28 07:17:28 2020] Killed process 11974 (freebayes) total-vm:29057920kB, anon-rss:27897504kB, file-rss:0kB, shmem-rss:0kB [Tue Apr 28 09:07:58 2020] Killed process 11969 (freebayes) total-vm:27072128kB, anon-rss:26995712kB, file-rss:0kB, shmem-rss:0kB [Tue Apr 28 09:55:25 2020] Killed process 11953 (freebayes) total-vm:28968300kB, anon-rss:26616164kB, file-rss:0kB, shmem-rss:0kB [Tue Apr 28 10:08:03 2020] Killed process 11954 (freebayes) total-vm:36468472kB, anon-rss:31890008kB, file-rss:0kB, shmem-rss:0kB [Tue Apr 28 11:07:11 2020] Killed process 11964 (freebayes) total-vm:44220784kB, anon-rss:41518752kB, file-rss:12kB, shmem-rss:0kB [Tue Apr 28 18:03:48 2020] Killed process 11982 (freebayes) total-vm:55519208kB, anon-rss:47863484kB, file-rss:0kB, shmem-rss:0kB [Thu Apr 30 00:26:41 2020] Killed process 11976 (freebayes) total-vm:51641428kB, anon-rss:47280844kB, file-rss:0kB, shmem-rss:0kB

My guess is that it did not complete, and the job output is probably not complete either .

Best, Min Dong

From: Carpenter, Kent E. kcarpent@odu.edu Sent: Thursday, April 30, 2020 11:43 AM To: Garcia, Eric e1garcia@odu.edu; Bird, Chris chris.bird@tamucc.edu; Dong, Min mdong@odu.edu Cc: hpc hpc@odu.edu Subject: Re: salloc problem on 38164

Hi Min and Chris, Eric submitted other iterations of the job and the job that had the bad_alloc problem is actually still running and writing to files. The files it is writing too are very large, larger than the files that were run on the other cores before the error message. So the guess is that the larger file manipulation caused the memory problem. Below is the memory usage for bad_alloc job 38164 (thanks Min for showing us how to use this!! Really cool!!). As you can see, the upper line leveled out. We got the bad_alloc write to out at 12:30 on 28 April. We have only 2 quick questions: 1) If this bad_alloc job does end up completing is there any reason to suspect that the bad_alloc error caused the results to be problematic? 2) Does the record of memory usage get deleted or is there a way to go back to the overwatch website to look at memory usage on completed jobs? Many thanks again! Kent

Kent E. Carpenter Professor & Eminent Scholar Department of Biological Sciences, PSB 3120A Old Dominion University Norfolk, Virginia 23529-0266 USA & Manager, IUCN Global Marine Species Assessment/ IUCN Species Programme Marine Biodiversity Unit:https://sites.wp.odu.edu/GMSA/ Office Phone: (757) 683-4197


From: Garcia, Eric e1garcia@odu.edu Sent: Wednesday, April 29, 2020 6:39 PM To: Carpenter, Kent E. kcarpent@odu.edu; Bird, Chris chris.bird@tamucc.edu; Dong, Min mdong@odu.edu Cc: hpc hpc@odu.edu Subject: Re: salloc problem on 38164

I went ahead and run a new job with less threads.

1.- Starting with a fresh mkVCF directory, I checked my free memory and divided that by the size of the cat*.bam file.
From previous runs, we know that file was 29G for this dataset. Yet I will roundup this number to 30G since very likely is not exactly 29G. My free memory was 320 (free -g) so I round down to 300. Thus 300G/30G = 10 processors

Thus I ran with 10proc 300G /home/e1garcia/PIRE_data/Ssp_Cap/mkVCF_less1

sbatch dDocentHPC.sbatch Submitted batch job 38834

I am expecting this to take a longggg time, so just in case, I set a ridiculous #SBATCH --time=144:00:00

2.- I am currently copying .bam and .bam.bai files into Ken't new directory "less2b_mkVCF". Once the copying is done. I will check the free memory on a new node and divide that by 30 again to set the # of proc. and run a new job.

Let me know if you think I should have done something else and I can cancel and rerun.

cheers, Eric


From: Garcia, Eric e1garcia@odu.edu Sent: Wednesday, April 29, 2020 4:50 PM To: Carpenter, Kent E. kcarpent@odu.edu; Bird, Chris chris.bird@tamucc.edu; Dong, Min mdong@odu.edu Cc: hpc hpc@odu.edu Subject: Re: salloc problem on 38164

Looking good! Thanks Min, Chris

Kent, want to meet up in zoom? E


From: Carpenter, Kent E. kcarpent@odu.edu Sent: Wednesday, April 29, 2020 2:07 PM To: Bird, Chris chris.bird@tamucc.edu; Dong, Min mdong@odu.edu; Garcia, Eric e1garcia@odu.edu Cc: hpc hpc@odu.edu Subject: Re: salloc problem on 38164

Thanks Min! These detailed answers are very interesting, informative and useful. Ok, I will get together with Eric and run these again as Chris suggests while monitoring the memory as you outlined and let you know what happens. Thanks, Kent

Kent E. Carpenter Professor & Eminent Scholar Department of Biological Sciences, PSB 3120A Old Dominion University Norfolk, Virginia 23529-0266 USA & Manager, IUCN Global Marine Species Assessment/ IUCN Species Programme Marine Biodiversity Unit:https://sites.wp.odu.edu/GMSA/ Office Phone: (757) 683-4197


From: Bird, Chris Chris.Bird@tamucc.edu Sent: Wednesday, April 29, 2020 2:00 PM To: Dong, Min mdong@odu.edu; Carpenter, Kent E. kcarpent@odu.edu; Garcia, Eric e1garcia@odu.edu Cc: hpc hpc@odu.edu Subject: RE: salloc problem on 38164

So, it sounds like it ran out of memory. When this is run again, decrease the number of threads

threads = rounddown(maxRamGB / catBamGB)

See if that works and use the method described by Min to see what happens with the ram as it runs.

On another node, I would try starting fresh with the bam and bam.bai files, but I really don’t think that was the problem.

From: Dong, Min mdong@odu.edu Sent: Wednesday, April 29, 2020 12:54 PM To: Kent Carpenter kcarpent@odu.edu; Bird, Chris Chris.Bird@tamucc.edu; Garcia, Eric e1garcia@odu.edu Cc: hpc hpc@odu.edu Subject: RE: salloc problem on 38164

Hi Dr. Carpenter,

  1. This is going to be difficult, there is couple issue here: a. if the bad_alloc application quit, then the memory will be released, and you wont see it any more b. if the bad_alloc application did quit, but other part of dDocent didn’t quit, then we might be able to add a “free -g” or “ps -f -u $USER” command into your dDocent script to show it output, I am still study the source code of dDocent, this is a possibility but I am yet to be sure the way to actually do it. c. If the bad_alloc application didn’t quit, then get this information from job script would be harder, I can create a wrapper to let dDocent call wrapper and wrapper call freebayes, the wrapper monitor freebayes output and check for memory when bad_alloc show up.

But, there is also another way to deal with this issue, it would not be automatic, but it would be easier to use, here is what you do: a. First check the node running your job, by using command: sacct -j JOB_ID -o nodelist -p example: sacct -j 38164 -o nodelist -p

NodeList| d6-w6420b-03| b. Now, go to website: https://overwatch.wahab.hpc.odu.edu/ This is a real time resource usage monitoring system I setup to debug issue likes c. Click on Node list on the right bottom corner d. Select d6 as rack on left top e. Click d4-w6420b-03 f. Click on Memory Usage and click view to enlarge the graph g. Click Time range on the top right corner, from Last 6 hours to lets last 2 days.

It may seems to be a lot of steps, it’s actually pretty simple to use, please give it a try. This is not going to be as convenient as if I add some sort way to display memory usage in dDocent script, but it should be easier to understand.

  1. The job is not timed out yet, since the other part of job is still running, it actually took all memory of system, you cannot even ssh into the system, because there is no memory left to create new sshd process, but it is still running, it will dead when the time limit is reached.

  2. It’s not necessary

Best, Min Dong

From: Carpenter, Kent E. kcarpent@odu.edu Sent: Wednesday, April 29, 2020 1:18 PM To: Dong, Min mdong@odu.edu Cc: Garcia, Eric e1garcia@odu.edu; Bird, Chris chris.bird@tamucc.edu Subject: salloc problem on 38164

Hi Min, We are encountering another bad_alloc problem in our dDocent run after you helped us solve the first alloc problem problem so this is knew. The attached outfile at the end shows:

" Mon Apr 27 15:36:39 EDT 2020 Genotyping individuals of ploidy 2 using freebayes... terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc" After that occurred the program stopped writing VCF files although it looks like the job is still active on my queu. We assume the bad_alloc problem is because of a memory problem as you mentioned previously. Three specific questions that we are hoping you can help us with: 1) How do we track down how much memory was used when we had the alloc problem
2) Was the job timed out when that bad_alloc problem occurred 3) If we are running an sbatch on Wahab should we also run salloc before running the program?

Thanks again for your help.

Kent

Kent E. Carpenter Professor & Eminent Scholar Department of Biological Sciences, PSB 3120A Old Dominion University Norfolk, Virginia 23529-0266 USA & Manager, IUCN Global Marine Species Assessment/ IUCN Species Programme Marine Biodiversity Unit:https://sites.wp.odu.edu/GMSA/ Office Phone: (757) 683-4197

cbird808 commented 4 years ago

HI Guys,

Sorry for the late update, I passed out early last night. The job is finished, no error in message, I got a 2.3 G Final output, can someone verify its correctness? I really have no idea. It should be readable by everybody:

/scratch/mdong003/kc/TotalRawSNPs.2.2.vcf.gz

The job detail is:

   JobID    JobName               Start                 End    Elapsed

40183 mkVCF_Ssp+ 2020-05-05T15:09:33 2020-05-06T01:37:26 10:27:53

I had to restart again to revert some changes I add to the script for testing, so this is actually restarted about 3 hour later since my last email sent.

The final modification is:

srun --cpu-bind=none -n 1 -r \$(expr \$PARALLEL_SEQ % $SLURM_NNODES) crun env LD_PRELOAD=/opt/conda/lib/libjemalloc.so

new: --cpu-bind=none it prevent Slurm binding process to specific core, normally this would be a good thing, but Slurm bind some process into same core

You can find a modified version of the script here:

/scratch/mdong003/kc/dDocentHPC.bash

But in this script, I only changed the one freebayes line that used in this computation, other conditions I did not modify, please let me know if somewhere else you want me to change.

The memory usage is below, there is still one spikes looks rather bad, it almost used all memory , but I can confirm there is no oom kill, I think maybe for safety you should asking for 120 core in sbatch, but maybe only have 105 or 108 in config.4.cbirdq

The job script header I used is:

SBATCH -n 120

SBATCH --ntasks-per-node 40

From the graph you can see, most of freebayes jobs actually finished around 5 ~ 7 hours, the rest 3~4 hours is spent completely on vcfcombine, and bgzip, I did not modify script here, so bgzip maybe using 120 thread instead of 40 thread, that probably made thing worse.

Next step:

  1. I think it’s best I write a small MPI wrapper to basically replace the functionality of GNU parallal, the benefit of doing it is below: a. Most importantly - It’s portable, simple MPI program only launch freebayes accordingly should have no dependency other than a MPI, can be easily built on any cluster, and can be run with mpirun/mpiexec/srun everywhere, so all scheduler is supported b. Not so importantly, launching a MPI on Slurm or any scheduler should use correct node binding and cpu binding, therefore remove the need of -r, -n, --cpu-bind part of the command, make it easier to type/modify
  2. I will look into the last few command that ran for 4 hours, see which part took so long, maybe I can do vcfcombine on multiple node like a binary merge style, or maybe it’s just bgzip take so long, in that case there is not much I can do actually. I need look into this.

Unfortunately I may have to reserve the next step to next week, I have a big down time coming up on May 10th, I am actually a little bit stress out.

Best, Min Dong