EC2 backend: slowness when the number of submitted jobs is large

GoogleCodeExporter commented 9 years ago

Hello,

I want to report some thing related to the execution on EC2 backend.
The application I used is a Map/reduce based workflow, where jobs are 
'embarrassingly parallel': meaning that I submit x jobs over N virtual machines 
and retrieve the output file of each job when it is finished.

Using EC2 backend of gc3Pie, I remarked a slowness when the number of submitted 
jobs is large (I tried with 200 jobs). I tested my SessionBasedScript code with 
200 jobs over:
- 20 VMs (m1.small) with a total of 20 cores
- 1 VM (xlarge) with 32 virtual cores
- 1 VM with 8 cores

Each job should last between 2 and 5 minutes. However, the script keep looping 
over the queued/remaining jobs and did not detect that some of the submitted 
jobs have already finished their executions. I checked that with the 'pid 1401' 
as an example, which is reported as not finished (see attached log file), but 
when I logged manually into the VM, I check that this process has already 
finished (using: ps eaf | grep 1401) and it corresponding job output file is 
there. After 5 hours, I get only 100 jobs downloaded on my machine..

It seems for me that the script detects the finished job only if it loop over 
the remaining jobs which take a while. Or, it could be that the script can not 
detect some of the finished jobs ..

If the number of jobs is low (for example 10 jobs), there is not problem and I 
get all my jobs output files downloaded.

Attached is a caption of the log file.

Regards,
Mohamed

Original issue reported on code.google.com by mohamed....@gmail.com on 28 Jul 2014 at 9:05

Attachments:

log1

GoogleCodeExporter commented 9 years ago

Dear Mohamed,

yes, this is indeed a know problem: as you point out, the GC3Pie
Engine will check job status one by one in a sequential fashion, which
is deadly with long lists of jobs.

We have already discussed the issue on the developers mailing list;
see also Issue 383.  There are two basic problems to be overvcome:

1) Python is rather bad at concurrently doing operations in a single
process.  To be able to scale up to lists of many jobs, we would need
to move to a multi-process architecture.  This is planned for GC3Pie
version 3, but I'm afraid work in that direction will not start soon.

2) The EC2 and OpenStack backends are quite slow also because they
function as a pool of remote SSH-connected hosts, which means that
every operation (like checking the state of a job) will open and then
close an SSH connection.  This becomes slow and intolerable already at
a hundredful of jobs; other backends (e.g., batch system ones) do not
have this overhead and can easily scale to thousands of jobs.
Luckily, this is easier to solve, using some clever caching and SSH
connection pooling.

Unfortunately, even 2) requires some work on the GC3Pie core, which in
turn requires some dedicated time and effort, which we cannot devote
right now.

What's the plan for your GC3Pie experiments?  Do you have any
deadlines or milestones that you need to meet by a certain date?

Thanks,
Riccardo

Original comment by riccardo.murri@gmail.com on 28 Jul 2014 at 9:20

GoogleCodeExporter commented 9 years ago

Original comment by riccardo.murri@gmail.com on 28 Jul 2014 at 9:21

GoogleCodeExporter commented 9 years ago

Dear Riccardo,

I am comparing GC3Pie with a programming model that I have developed in my
thesis, and I need to run  some experimental results before end of July. I
understand that these two basic problems require effort and time.

Suppose I am using 1 VM (with 32 cores) to tun a bag of jobs. The  slowness
in this case is related to the SSH-connection (point 2) or to the
process concurrency
in python (point 1) ?

If it is related to SSH-connection, I could reserve some time to see if I
can add a temporary SSH-cache solution.

Thanks!
Mohamed

2014-07-28 11:21 GMT+02:00 <gc3pie@googlecode.com>:

Original comment by mohamed....@gmail.com on 28 Jul 2014 at 12:16

GoogleCodeExporter commented 9 years ago

Hello,

An alternative solution is that I mount a SGE cluster on AWS (using
starcluster package: http://star.mit.edu/cluster/) composed of 20 VMs (each
VM has one core). Then I use the SGE backend with 200 jobs.
This may resolve my issue for the moment ?

Thanks.
Mohamed

2014-07-28 14:16 GMT+02:00 Mohamed Ben Belgacem <
mohamed.benbelgacem@gmail.com>:

Original comment by mohamed....@gmail.com on 28 Jul 2014 at 12:35

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Sorry for all these questions.

I managed to mount an SGE cluster on AWS, and I can ssh to the front-end node 
using a keypair:
ssh -i ~/.ssh/aws-cluster.rsa sgeadmin@ec2-23-23-33-153.compute-1.amazonaws.com

Unfortunately, my script says that it can not connect to the front-end node: 
TransportError: Failed while connecting to remote host 
'ec2-23-23-33-153.compute-1.amazonaws.com': 'int' object is not iterable. The 
log file is attached.

My config file is :
[auth/sshgordias]
type=ssh
username=sgeadmin
keypair_name=aws-cluster.rsa

[resource/gordias]
enabled = true
type = sge
auth = sshgordias
transport = ssh
frontend = ec2-23-23-33-153.compute-1.amazonaws.com
architecture = x86_64
max_cores = 20
max_cores_per_job = 1
max_memory_per_core = 1
max_walltime = 8
qstat = /usr/local/bin/qstat
qacct = /usr/local/sbin/qacct.sh
qdel = /usr/local/bin/qdel
keypair_name=aws-cluster.rsa

Original comment by mohamed....@gmail.com on 28 Jul 2014 at 2:14

Attachments:

log

GoogleCodeExporter commented 9 years ago

Dear Mohamed,

> Suppose I am using 1 VM (with 32 cores) to tun a bag of jobs. The
> slowness in this case is related to the SSH-connection (point 2) or
> to the process concurrency in python (point 1) ?

In this case it's mainly point 2.

As you suggest in another comment, using a virtual cluster (may I
pitch our own alternative to StarCluster? it's called ElastiCluster,
see http://gc3-uzh-ch.github.io/elasticluster ) would quite likely be better.

Thanks,
Riccardo

Original comment by riccardo.murri@gmail.com on 28 Jul 2014 at 2:47

GoogleCodeExporter commented 9 years ago

Dear Mohamed,

> Unfortunately, my script says that it can not connect to the front-end
> node: TransportError: Failed while connecting to remote host
> 'ec2-23-23-33-153.compute-1.amazonaws.com': 'int' object is not
> iterable. The log file is attached.

Can you please open a different issue to discuss this?  It seems quite
unrelated to the scalability issues of GC3Pie.

Thanks,
Riccardo

Original comment by riccardo.murri@gmail.com on 28 Jul 2014 at 2:50

GoogleCodeExporter commented 9 years ago

Original comment by riccardo.murri@gmail.com on 19 Aug 2014 at 8:47

Changed state: Accepted

abhilekhsingh / gc3pie

EC2 backend: slowness when the number of submitted jobs is large #451