cloudify-cosmo / cloudify-openstack-plugin

Cloudify OpenStack Plugin
20 stars 66 forks source link

Running nova_plugin.server.create twice leaks servers #51

Open mutability opened 9 years ago

mutability commented 9 years ago

If the nova_plugin.server.create operation is run twice on the same node for any reason (e.g. you rerun a partially-failed install workflow) then a new server is created, and the existing server is leaked - it is no longer known to Cloudify and will not be cleaned up.

Running the operation twice should either be idempotent, or fail. (idempotency would be nicer!)

iliapolo commented 9 years ago

Hi, just to make sure i understand, you would like cloudify to identify that a server was already created for a particular node instance and use that server in that case?

if you are referring to cleanup than yes currently we have an issue with re-running failed workflows. the way to do it for now is to run the "opposite" workflow before running the original one again. so basically if you run uninstall before running install again, it would have cleaned up that server. an issue to discuss here is probably the rollback abilities of a workflow, and the ability to start it again from a very particular point, ideally re-running the workflow from the last failure point.

mutability commented 9 years ago

If create is called and there are already runtime properties on the node instance pointing to a VM, and that VM still exists, then create should definitely not create a new VM and overwrite the existing properties and lose knowledge of the old (still running) VM. It would be nice to have it just return quietly in this case (much like how calling start on an already-started server is harmless). But it would be OK to fail too. The silently-leaks-resources part is the real problem at the moment.

What I'm trying to achieve is a "install/repair the system" workflow that always tries to progress from the current state towards the fully-installed state. Ideally that would be the built-in "install" workflow. But to support this we at least need lifecycle operations that don't leak resources unpredictably.

Cleanup and rollback are complex problems, I don't have a particular opinion there, other than to point out it's really hard to make aggregate operations like workflows atomic in the face of failure.. so you're going to have to deal with partially-created states at some point. (And of course there's things like VM failure that can result in similar-looking states)

mutability commented 9 years ago

BTW, if you have such an "install/repair" workflow, then healing from failures is a bit simpler: tear down (reset) all failed node instances and anything contained within them, then run the install/repair workflow.

iliapolo commented 9 years ago

I agree with what you said about cleanup and rollback being complex problems. we have already started discussion around this area.

regarding what you are trying to achieve, i would make a differentiation of repair versus install. repair implies that something was broken, re-running the install workflow does not mean something is broken, but that something was not installed properly, and here idempotency is very useful like you mentioned. basically post vs pre deployment workflows.

regarding this repair workflow, we actually already have started the development of this workflow. its currently in testing phase and will be available in the 3.2 release. what it does is exactly what you described + executes all necessary relationship operations of course.

this workflow can be integrated with our monitoring system to achieve automatic healing...

mutability commented 9 years ago

The repair workflow is probably always going to be a superset of the install workflow (consider the "everything fails" case)

iliapolo commented 9 years ago

right, its not just a superset, in the current implementation it will also contain the uninstall/teardown workflow. in any case i see your point about the openstack plugin create operation, will write back here soon about this.