Change the way the EQRM queues jobs on the NCI - submit jobs in checkpoint/restart chunks

The EQRM is currently not optimised to take advantage of the queuing system on 
the NCI. The NCI have suggested that we look at modifying the EQRM so that it 
submits jobs in smaller chunks - see below for e-mail communication between 
David Robinson and the NCI dated 28 August 2012. 

==================================================

Thanks for your reply, Margaret. 

You have made a strong case for us to address this from our end using 
checkpoint/restart. I have put this request to our software engineers and they 
will be looking at it over the coming days. 

Would it also be possible to get a small increase as follows:
Walltime: 6 hours => 24 hours
NCPUS: 64 => 128

Also, I have a more general question about the types of memory you put in the 
jobfile. What is the difference between vmem and jobfs and do they scale with 
the number of ncpus (i.e. if we increase the number of ncpus to we get more 
vmem and jobfs). 

Regards
David

-----Original Message-----
From: Apache [mailto:apache@anusf.anu.edu.au] On Behalf Of Margaret Kahn for 
help
Sent: Tuesday, 28 August 2012 1:21 PM
To: Robinson David
Subject: [nf.nci.org.au #33149] Can I increase my walltime for a vayu 
simulation? [DLM=For-Official-Use-Only] 

David,

 Before doing this we always ask if it is posibke to run your job in checkpoint/restart mode as a sequence of short jobs, rather than one long job.

There are lots of reasons why you should choose to run short jobs instead of 
long jobs and, in particular, why you should break up really long jobs into 
small checkpoint/restart chunks if you can. 

  * you get better queue utilization: long jobs will generally queue longer and be suspended longer

  * you will protect yourself against system and node failures. It can be very frustrating waiting patiently for a week long job to see it fail in the last day or hour.  We do not reimburse this lost time.

  * you will be able to run jobs in smaller time windows leading up to system or node dowtimes.  If we have fortnightly system downtimes and you want to run week long jobs, then you will be able to get jobs started half the time.

The best way to use checkpoint/restart is in automated self-resubmitting jobs - 
you submit the first job and the next 20 jobs just take care of themselves.  
There are some template batch scripts in /apps/examples/scripts/ that might 
help you get started. Contact help if you want assistance.

Note that for maintenance reasons, we need to bring nodes or the whole system 
down from time-to-time.  Long jobs can make scheduling this difficult.  As a 
result, on occasions, long jobs may be ignored (i.e. effectively terminated) 
when deciding on downtimes. 

If this is not possible we can increase your walltime limit,

 Margaret

> [david.robinson@ga.gov.au - Tue Aug 28 13:14:46 2012]:
> 
> Hi
> 
> My name is David Robinson (djr547). I am trying to run an earthquake
>    hazard simulation for a small part Indonesia under project w84 (see
>    /short/w84/sandpits/drobinson/test50). I tried running it on 64
>    processors with a walltime of 3 hours and it managed only 10% of
>    the job.
> My best estimate of the required walltime for this job is 30 hours but
>    I currently have a limit of 6 hours. See message:
> qsub: walltime request (72:00:00) exceeds limit (06:00:00) for 64
>    cpu(s)
> 
> 
> Is it possible to increase my walltime limit to 100 hours? This
>    simulation is much smaller than some of the others we plan to run
>    over the coming months.
> 
> Also, I noted that I got a similar message when I asked for more than
>    180Gb of vmem. This may be enough but for future reference is it
>    possible to increase this if needed?
> 
> Thanks in advance
> David
> 
> 
> Geoscience Australia Disclaimer: This e-mail (and files transmitted
>    with it) is intended only for the person or entity to which it is
>    addressed. If you are not the intended recipient, then you have
>    received this e-mail by mistake and any use, dissemination,
>    forwarding, printing or copying of this e-mail and its file
>    attachments is prohibited. The security of emails transmitted
>    cannot be guaranteed; by forwarding or replying to this email, you
>    acknowledge and accept these risks.
> ----------------------------------------------------------------------
> ---------------------------------------------------
> 
> 
> 
--
Dr Margaret Kahn,
Academic consultant,
ANU Supercomputer Facility,
NCI National Facility,
Leonard Huxley Building,
The Australian National University,
Canberra ACT 0200 Australia

Telephone  : +61 2  6125 4541
Fax               : +61 2  6125 8199
E-mail          : Margaret.Kahn@anu.edu.au
WWW           : http://nf.nci.org.au
Original issue reported on code.google.com by RobinsonDavidJ0@gmail.com on 29 Aug 2012 at 5:55
GeoscienceAustralia / eqrm

Change the way the EQRM queues jobs on the NCI - submit jobs in checkpoint/restart chunks #79