GeoscienceAustralia / eqrm

Automatically exported from code.google.com/p/eqrm
Other
5 stars 4 forks source link

Reconsider how EQRM jobs in parallel handle I/O #78

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Recent feedback from the NCI suggests that they are not happy with the way that 
the EQRM handles I/O when run on the NCI. 

We need to look at improving the way that the EQRM does this so that we are 
using the NCI (and other clusters) as efficiently as possible. See below for 
summary of e-mail communication: 

===========================================================
Message from David Robinson to Duncan Gray: 29 August 2012
===========================================================

Hi Duncan 

Thanks for the chat this morning

I have read the e-mails that you sent yesterday regarding the fibre cable issue 
and I have reviewed the feedback that I received from the NCI last night and 
here is my summary: 

My job yesterday caused alarms at NCI because of the time that it was spending 
in I/O. Based on our discussion this morning, we agreed that I should review 
previous feedback from the NCI that was received in April 2012 to assess 
whether moving my job to the vayu nodes with fibre cabling (fc) would overcome 
this issue. This is what I learnt from the feedback that we received in April: 

1)       There are 32 nodes with fc and 3GB memory and 24 nodes with fc and 6GB 
memory. This represents only 56 of the 1488 nodes on vayu (or 3.7%)

2)       That the fc nodes are in high demand. That if we request fc nodes for 
large jobs we are likely to get stuck in the queue.

3)       That the NCI recommend that we do not seek fc nodes but rather should 
avoid running global jobs with shared access to /short. 

See below for extract of an e-mail from NCI that was received in April 2012  
(EXTRACT 1) . 

Based on the feedback received last night – the NCI have confirmed that they 
are unhappy with the EQRM’s I/O to /short (see NCI – Extract 2)

My conclusion from reading all this feedback is that we need to consider the 
way that the EQRM handles I/O. It is not simply a matter of moving our jobs to 
the nodes with fc. I will raise a ticket for this on Google Code and a separate 
ticket for splitting large jobs into smaller jobs. I don’t believe that I am 
in a position to resolve these software issues so I will be looking to you for 
the leadership on getting the EQRM up and running on NCI in a way that is 
efficient and allows us to do the large simulations for projects here in 
Australia and Overseas. 

I’m happy to keep discussing if you have other thoughts. 

David

=========================================

EXTRACT 1: email from NCI

Nodes          Jobs       Memory       Local       Global

                          per core      jobfs       jobfs

                                        per node

v[1-1200]     parallel      3GB         7GB        large

v[1201-1408]    any         3GB        10GB         none

v[1409-1440]    any         3GB      fc (large)     none

v[1441-1464]    any         6GB        10GB         none

v[1465-1488]    any         6GB      fc (large)     none

The critical thing is to avoid global jobfs (shared with /short).

When you were requesting 3GB per core or less, you had a large probability 
chance of getting global jobfs.  As soon as you request

  > 3GB per core, you will never get global jobfs.  But by requesting fc and too much jobfs, you are missing out on the possibility of running on v[1441-1464] which are *much* less in demand than v[1465-1488].

So its not fc that you want.  It's to avoid global jobfs.

=========================================

EXTRACT 2: - E-mail from NCI

Just to give you an idea of what's going on, the metadata servers for the vayu 
filesystems (particularly /short) usually have a load of around 1-3. Since your 
job started, the /short metadata server load has sat at around 40. After 
suspending your job, its back to 1.

Basically, the scalability of your job is limited by using the filesystem for 
communication.  Its probably worth noting that network (MPI) communication will 
be at least 3 orders of magnitude faster than files and not be shared (and so 
not be a "bottleneck" point).

Original issue reported on code.google.com by RobinsonDavidJ0@gmail.com on 29 Aug 2012 at 5:47

GoogleCodeExporter commented 9 years ago
This was fixed a while back.
Mainly by reducing writing to log files and not using .npy files.

Original comment by duncan.g...@gmail.com on 4 Feb 2013 at 5:52

GoogleCodeExporter commented 9 years ago

Original comment by duncan.g...@gmail.com on 4 Feb 2013 at 5:52