ConPaaS-team / conpaas

ConPaaS: integrated runtime environment for elastic cloud applications
http://www.conpaas.eu
BSD 3-Clause "New" or "Revised" License
14 stars 3 forks source link

Problem stopping the Galera service on Amazon EC2 #82

Closed gpierre42 closed 9 years ago

gpierre42 commented 9 years ago

I tried using the Galera service on the EC2-based ConPaaS deployment provided by @tcrivat. I created the service, then started it, then stopped it. For some reason the service didn't stop, and remained in "adapting" state forever.

tcrivat commented 9 years ago

This is the same as issue #68. The service will stop, but only after a long time.

gpierre42 commented 9 years ago

Really? I had time to go and have lunch, and when I came back the service still hadn't stopped. In case you wonder why the instances are now gone, I killed them by hand in the EC2 dashboard.

tcrivat commented 9 years ago

That's what I remember was happening. I will check again myself.

tcrivat commented 9 years ago

Indeed, this is a different issue than #68. The service doesn't stop at all. But it happens only on Amazon EC2. I tested on OpenNebula, and the service stops fine there (after a delay of ~2 min because of #68). I will update the title to reflect that this is an Amazon EC2-specific issue.

alescernivec commented 9 years ago

Is there AMI image we can use?

tcrivat commented 9 years ago

The current ConPaaS deployment on Amazon EC2 is accessible here ( http://conpaas-online.ddns.net/ ) and it uses ami-5594d765. You can use directly the installation located at that link.

alescernivec commented 9 years ago

I remembered it was posted somewhere before :) However, I can not find the AMI under community AMIs on EC2, I will try will try the direct link, thanks.

tcrivat commented 9 years ago

This happens because the scratch volume is not detached, although the detach_volume function from the EC2 driver library returns True. Because the volume is not detached, deleting the volume fails and an exception is thrown in this line:

https://github.com/ConPaaS-team/conpaas/blob/dev/conpaas-services/src/conpaas/core/manager.py#L254

The exception is not caught anywhere in the Galera manager code, so the whole service stopping procedure halts.

This exception is not caught in XtreemFS either, but in the XtreemFS case, the volume is detached successfully, so this bug does not happen.

tcrivat commented 9 years ago

A quick (and dirty) fix would be to suppress that exception. The service will shut down promptly (including the agent VM), but the volume will not be deleted.

tcrivat commented 9 years ago

According to the Amazon EC2 documentation, a volume should be unmounted from the VM before detaching:

http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/ApiReference-cmd-DetachVolume.html

As the documentation says: "Failure to do so will result in the volume being stuck in "busy" state while detaching.". This seems to be exactly what is happening in our case.

The commit that added support for volumes in the Galera service (db53252c5472f0d905f12122e83cb31f08c86642) does not seem to contain any code that does the unmount, so I presume that Galera does not unmount the volume at all.

There is also another possibility, to forcefully detach the volume even if it is not unmounted first. This is allowed in the Amazon EC2 CLI, however the libcloud library that we use does not seem to support this option:

http://libcloud.apache.org/apidocs/0.11.3/libcloud.compute.drivers.ec2.EC2NodeDriver.html#detach_volume

So, as a conclusion, this issue can be solved only by modifying the Galera service to unmount the volume before detaching.

FrancoCaffarraAndEsterDiBello commented 9 years ago

How we can delete the instances of a frozen service? Someone can do this for us?

tcrivat commented 9 years ago

Sure, I can do that. Any time you need me to do this, please send me an e-mail on my mail address ( teodor.crivat@gmail.com ). Thanks.

FrancoCaffarraAndEsterDiBello commented 9 years ago

We have upgrade the code for the unmount in a7579101d693ad9cf56551b42ab87a38f8b508c8. Please can you upgrade the installation with the current code to test it?

tcrivat commented 9 years ago

It fails with "Sorry: IndentationError: unexpected indent (role.py, line 391)". I fixed the indentation and will try again.

FrancoCaffarraAndEsterDiBello commented 9 years ago

ok thanks :)

Il giorno 11/ott/2014, alle ore 19:37, Teodor Crivat notifications@github.com ha scritto:

It fails with "Sorry: IndentationError: unexpected indent (role.py, line 391)". I fixed the indentation and will try again.

— Reply to this email directly or view it on GitHub.

tcrivat commented 9 years ago

It works! The Galera service now stops cleanly and promptly on both Amazon EC2 and OpenNebula.

Thanks @FrancoCaffarraAndEsterDiBello for your prompt action.

This issue can be closed now.