Open GoogleCodeExporter opened 9 years ago
Dear Mohamed,
yes, this is indeed a know problem: as you point out, the GC3Pie
Engine will check job status one by one in a sequential fashion, which
is deadly with long lists of jobs.
We have already discussed the issue on the developers mailing list;
see also Issue 383. There are two basic problems to be overvcome:
1) Python is rather bad at concurrently doing operations in a single
process. To be able to scale up to lists of many jobs, we would need
to move to a multi-process architecture. This is planned for GC3Pie
version 3, but I'm afraid work in that direction will not start soon.
2) The EC2 and OpenStack backends are quite slow also because they
function as a pool of remote SSH-connected hosts, which means that
every operation (like checking the state of a job) will open and then
close an SSH connection. This becomes slow and intolerable already at
a hundredful of jobs; other backends (e.g., batch system ones) do not
have this overhead and can easily scale to thousands of jobs.
Luckily, this is easier to solve, using some clever caching and SSH
connection pooling.
Unfortunately, even 2) requires some work on the GC3Pie core, which in
turn requires some dedicated time and effort, which we cannot devote
right now.
What's the plan for your GC3Pie experiments? Do you have any
deadlines or milestones that you need to meet by a certain date?
Thanks,
Riccardo
Original comment by riccardo.murri@gmail.com
on 28 Jul 2014 at 9:20
Original comment by riccardo.murri@gmail.com
on 28 Jul 2014 at 9:21
Dear Riccardo,
I am comparing GC3Pie with a programming model that I have developed in my
thesis, and I need to run some experimental results before end of July. I
understand that these two basic problems require effort and time.
Suppose I am using 1 VM (with 32 cores) to tun a bag of jobs. The slowness
in this case is related to the SSH-connection (point 2) or to the
process concurrency
in python (point 1) ?
If it is related to SSH-connection, I could reserve some time to see if I
can add a temporary SSH-cache solution.
Thanks!
Mohamed
2014-07-28 11:21 GMT+02:00 <gc3pie@googlecode.com>:
Original comment by mohamed....@gmail.com
on 28 Jul 2014 at 12:16
Hello,
An alternative solution is that I mount a SGE cluster on AWS (using
starcluster package: http://star.mit.edu/cluster/) composed of 20 VMs (each
VM has one core). Then I use the SGE backend with 200 jobs.
This may resolve my issue for the moment ?
Thanks.
Mohamed
2014-07-28 14:16 GMT+02:00 Mohamed Ben Belgacem <
mohamed.benbelgacem@gmail.com>:
Original comment by mohamed....@gmail.com
on 28 Jul 2014 at 12:35
[deleted comment]
Sorry for all these questions.
I managed to mount an SGE cluster on AWS, and I can ssh to the front-end node
using a keypair:
ssh -i ~/.ssh/aws-cluster.rsa sgeadmin@ec2-23-23-33-153.compute-1.amazonaws.com
Unfortunately, my script says that it can not connect to the front-end node:
TransportError: Failed while connecting to remote host
'ec2-23-23-33-153.compute-1.amazonaws.com': 'int' object is not iterable. The
log file is attached.
My config file is :
[auth/sshgordias]
type=ssh
username=sgeadmin
keypair_name=aws-cluster.rsa
[resource/gordias]
enabled = true
type = sge
auth = sshgordias
transport = ssh
frontend = ec2-23-23-33-153.compute-1.amazonaws.com
architecture = x86_64
max_cores = 20
max_cores_per_job = 1
max_memory_per_core = 1
max_walltime = 8
qstat = /usr/local/bin/qstat
qacct = /usr/local/sbin/qacct.sh
qdel = /usr/local/bin/qdel
keypair_name=aws-cluster.rsa
Original comment by mohamed....@gmail.com
on 28 Jul 2014 at 2:14
Attachments:
Dear Mohamed,
> Suppose I am using 1 VM (with 32 cores) to tun a bag of jobs. The
> slowness in this case is related to the SSH-connection (point 2) or
> to the process concurrency in python (point 1) ?
In this case it's mainly point 2.
As you suggest in another comment, using a virtual cluster (may I
pitch our own alternative to StarCluster? it's called ElastiCluster,
see http://gc3-uzh-ch.github.io/elasticluster ) would quite likely be better.
Thanks,
Riccardo
Original comment by riccardo.murri@gmail.com
on 28 Jul 2014 at 2:47
Dear Mohamed,
> Unfortunately, my script says that it can not connect to the front-end
> node: TransportError: Failed while connecting to remote host
> 'ec2-23-23-33-153.compute-1.amazonaws.com': 'int' object is not
> iterable. The log file is attached.
Can you please open a different issue to discuss this? It seems quite
unrelated to the scalability issues of GC3Pie.
Thanks,
Riccardo
Original comment by riccardo.murri@gmail.com
on 28 Jul 2014 at 2:50
Original comment by riccardo.murri@gmail.com
on 19 Aug 2014 at 8:47
Original issue reported on code.google.com by
mohamed....@gmail.com
on 28 Jul 2014 at 9:05Attachments: