Ryotaro-Sanpe666 / google-cloud-sdk

Automatically exported from code.google.com/p/google-cloud-sdk
0 stars 0 forks source link

Phantom app version #284

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?

I deployed a new version of my app, then deleted the old version like I've done 
many times before:

gcloud preview app deploy app.yaml --docker-build=remote --version 704 --force 
--set-default

gcloud preview app modules delete default --version 702

No errors were reported by either command.  Note that I have no idea if this 
problem is reproducible -- getting trapped by it once was scary enough I don't 
want to try again!

What is the expected output? What do you see instead?

I expected version 704 to become the default and spin up new instances -- this 
happened.  I also expected version 702 to be deleted and shut down its 
instances -- the version was deleted, but the instances kept running.  

What is the output of 'gcloud info'?

Google Cloud SDK [0.9.79]

Platform: [Linux, x86_64]
Python Version: [2.7.6 (default, Mar 22 2014, 22:59:56)  [GCC 4.8.2]]
Site Packages: [Enabled]

Installation Root: [/root/google-cloud-sdk]
Installed Components:
  core: [2015.09.23]
  core-nix: [2015.09.03]
  app: [2015.09.23]
  gcloud: [2015.09.21]
  gsutil-nix: [4.14]
  gsutil: [4.15]
  bq: [2.0.18]
  preview: [2015.09.21]
  bq-nix: [2.0.18]
System PATH: 
[/root/bin:/root/ve/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:
/bin:/root/usr/local/bin:/root/google-cloud-sdk/bin]
Cloud SDK on PATH: [True]

Installation Properties: [/root/google-cloud-sdk/properties]
User Config Directory: [/root/.config/gcloud]
User Properties: [/root/.config/gcloud/properties]
Current Workspace: [None]
Workspace Config Directory: [None]
Workspace Properties: [None]

Account: [piotr@reviewable.io]
Project: [reviewable-prod]

Current Properties:
  [core]
    project: [reviewable-prod]
    account: [piotr@reviewable.io]
    disable_usage_reporting: [False]

Logs Directory: [/root/.config/gcloud/logs]
Last Log File: [/root/.config/gcloud/logs/2015.09.28/19.47.36.882074.log]

Please provide any additional information below.

Attempting to delete the instances manually was ineffective as they just got 
restarted by App Engine.  The logs showed the instances being bounced up and 
down due to failing health checks.  Both the console and `gcloud preview app 
modules list default` showed only version 704.

I eventually recovered by forcing a new deploy of version 702, then deleting it 
from the console (Compute > App Engine > Versions).  This ran into its hiccup 
when my first deploy attempt claimed that:

The resource 
'projects/reviewable-prod/zones/us-central1-f/instances/gae-builder-vm-702' 
already exists

This instance did not show up in the console (Compute Engine > VM instances) 
before running the deploy, but did show up afterward.  Deleting the instance 
and rerunning the deploy worked this time.  Just mentioning it in case it has 
bearing on the phantom version issue.

Original issue reported on code.google.com by pi...@ideanest.com on 5 Oct 2015 at 10:06

GoogleCodeExporter commented 8 years ago
Thanks for reporting this, and good catch. Sorry about the hassle.

I'll route this to the AppEngine team.

Original comment by z...@google.com on 5 Oct 2015 at 1:08

GoogleCodeExporter commented 8 years ago
Hey,

We delete VMs asynchronously, so it's possible they're still running when the 
initial deletion returns a success. In the case of failures, we have a periodic 
maintenance job that runs frequently looking for leaked VMs like this, and 
deletes them.

How long did you wait before deploying a new version and deleting it?

Original comment by dlor...@google.com on 5 Oct 2015 at 4:50

GoogleCodeExporter commented 8 years ago

Original comment by dlor...@google.com on 5 Oct 2015 at 4:50

GoogleCodeExporter commented 8 years ago
The interval between my failed attempt to delete v702 and the redeployment and 
successful delete was a bit over 1 hour.

Original comment by pi...@ideanest.com on 5 Oct 2015 at 5:03

GoogleCodeExporter commented 8 years ago
Thanks for the report. I think this is basically the expected behavior of our 
system. I see some errors deleting the VMs in our logs, and then you manually 
fixed this before our next cleanup run could come along. If you see this again, 
please let me know.

Original comment by dlor...@google.com on 6 Oct 2015 at 5:02

GoogleCodeExporter commented 8 years ago
For reference, what's the interval between cleanup attempts in your system?  I 
assume I continue to be billed for the phantom VMs, so if I happened to be 
doing a bunch of deploys in a row (as sometimes happens when figuring out a 
build breakage) I could end up on the hook for a not-entirely-trivial amount of 
money.

Original comment by pi...@ideanest.com on 6 Oct 2015 at 7:11

GoogleCodeExporter commented 8 years ago
We currently run this every 2 hours. In most cases this job should do nothing 
though, since deleting the version should delete the VMs. That job is only 
relied on if deleting the VMs failed for some reason. You're correct that you 
do continue to get billed here, but this should be a very rare case.

Original comment by dlor...@google.com on 9 Oct 2015 at 5:46

GoogleCodeExporter commented 8 years ago
Looks like this happened to me again on a deploy ~10 minutes ago.  I deployed 
version 712 and deleted version 706, but it's still running.  Either I'm very 
unlucky or there's a reproducible issue with the initial VM deletion...  I'm 
going to let the v706 VMs keep running this time to see if the backup process 
cleans them up.

Original comment by pi...@ideanest.com on 11 Oct 2015 at 7:27

GoogleCodeExporter commented 8 years ago
Looks like the v706 instances got cleaned up ~30 minutes after the deploy.  
I'll keep an eye out on future deploys to see if the initial deletion failure 
is common for me.

Original comment by pi...@ideanest.com on 11 Oct 2015 at 7:59

GoogleCodeExporter commented 8 years ago
Happened again:  deployed v714 at 10:29pm PDT, and the v712 instances didn't 
get shut down until 10:49pm.  A 20 minute delay isn't terrible, I guess, but it 
still looks like the initial deletion is failing consistently.  And also means 
I have to run that much longer with version skew...

Original comment by pi...@ideanest.com on 12 Oct 2015 at 5:54