abhilekhsingh / gc3pie

Automatically exported from code.google.com/p/gc3pie
0 stars 0 forks source link

EC2 backend still attempts submission to VMs in ERROR or TERMINATED state #408

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
From comment on issue #405:

| gc3.gc3libs: ERROR: VM with id `i-0000583e` is in ERROR state. Terminating it!
| gc3.gc3libs: ERROR: Ignored error in submitting task 
'GTSubControllApplication.464': ValueError: _make_resource: `remote_ip`

This shows that `gc3libs.backends.ec2` detects that a VM is in error
state during `get_resource_status()`, but then still attempts
submission to it when `subit_job()` is called.

I think the problem is in lines 778--783: the VM is terminated, but
the associated sub-resource is not removed from the list:

            elif vm.state == 'error':
                # The VM is in error state: exit.
                gc3libs.log.error(
                    "VM with id `%s` is in ERROR state."
                    " Terminating it!", vm.id)
                vm.terminate()
                self._vmpool.remove_vm(vm.id)

Similar behavior happens for VM that are found to be in `terminated`
state.

I would modify the code to remove the associated sub-resource, but I
would like confirmation that this does not break the logic elsewhere
first.

Original issue reported on code.google.com by riccardo.murri@gmail.com on 26 Jul 2013 at 10:27

GoogleCodeExporter commented 9 years ago
Upon closer analysis of the code, the task does not seem trivial.

The code in `EC2Lrms.get_resource()` creates the sub-resource
before the VM is in an OK state, and does so *knowingly*:

    def get_resource_status(self):
        self.updated = False
        # Since we create the resource *before* the VM is actually up
        # & running, it's possible that the `frontend` value of the
        # resources points to a non-existent hostname. Therefore, we
        # have to update them with valid public_ip, if they are
        # present.

        self._connect()
        ...

Indeed, the sub-resource is created regardless of the VM state, except
for 'pending' which jumps to the next VM in list. See lines 781--805.

I think this is wrong, and no sub-resource should be created for VMs
in state 'error', 'terminated', 'shutting-down' or 'stopped'.

If we do that, then there are two problems to solve:

1) The `_get_remote_resource` method creates a sub-resource
   *unconditionally*, and is called at the start of most job-related
   methods: `cancel_job`, `get_results`, `update_job_state`, `peek`,
   `free`.

   All the above methods assume that a job has an `ec2_instance_id`
   attribute.  This hints at other bugs:

     - `ec2_instance_id` created by `submit_job`, so jobs in NEW state
       would trigger an exception;

     - `ec2_instance_id` *not* deleted by `free()`, so we could in principle trigger
       errors by just calling any job-related method on a TERMINATED job.

2) If the VM turns into error/stopped/shutting-down state while there
   are still jobs running on it, what should the job-related methods
   do?  Raise error?  Set the job state to UNKNOWN?  Both?

   And where should we check for the sub-resource VM state?  In
   `Ec2Lrms` or in the sub-resource backend?

Original comment by riccardo.murri@gmail.com on 13 Aug 2013 at 1:30

GoogleCodeExporter commented 9 years ago
I don't have time to look into this issue in deep, but as far as I remember
the problem was that a "pending" vm is not just a VM in "pending" state,
unfortunately. It is possible that the VM is in "running" state but still
booting, or didn't receive a public IP address yet. Therefore, the only way
to be sure that we can actually run on that VM is to create a resource
associated to it and wait until the resource is able to update itself by
connecting to the VM, checking the number of cores or whatever it has to
do. This approach would also work if the VM is actually a a cluster
managed, for instance, by elasticluster: there could not be an
easy/portable/generic way to know where the cluster is correctly
configured, so we just wait until get_resource_status() is able to actually
updat ethe resource status.

We probably need a way to remove a resource whenever the VM is in error
state or has not been terminated by gc3pie, but I don't have time to think
how this is better done right now.

.a.

If we do that, then there are two problems to solve:

Original comment by antonio....@gmail.com on 14 Aug 2013 at 12:55