gcloud compute disks snapshot always times out. Needs a timeout option.

GoogleCodeExporter commented 9 years ago

When creating a snapshot with "gcloud compute disks snapshot", the command 
always times out if the disk is large (10TB) with the following error:

"ERROR: (gcloud.compute.disks.snapshot) Some requests did not succeed:
 - failed to createSnapshot the following resources within 300s:" http://...

That way we cannot easily check if the snapshot actually started or not, or 
some other error occured. Please ammend the software to have those additional 
options:

* configurable timeout. 300s seems too short for large volumes with high I/O.
* option to return immediately so as to check the results asynchronously with 
subsequent commands.

The command works fine for small disks.

-----------------------
gcloud info:
Google Cloud SDK [0.9.36]

Platform: [Linux, x86_64]
Python Version: [2.7.5 (default, Jun 17 2014, 18:11:42)  [GCC 4.8.2 20140120 
(Red Hat 4.8.2-16)]]
Site Packages: [Disabled]

Installation Root: [/usr/local/share/google/google-cloud-sdk]
Installed Components:
  core: [2014.11.06]
  core-nix: [2014.10.20]
  gcutil: [1.16.5]
  gsutil-nix: [4.6]
  gsutil: [4.6]
  bq: [2.0.18]
  dns: [2014.11.06]
  sql: [2014.11.06]
  compute: [2014.11.06]
  gcutil-nix: [1.16.5]
  bq-nix: [2.0.18]
System PATH: 
[/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/usr/local/bi
n:/root/bin]
Cloud SDK on PATH: [False]

Installation Properties: [/usr/local/share/google/google-cloud-sdk/properties]
User Config Directory: [/root/.config/gcloud]
User Properties: [/root/.config/gcloud/properties]
Current Workspace: [None]
Workspace Config Directory: [None]
Workspace Properties: [None]

Account: [sivann@inaccess.com]
Project: [ecstatic-gantry-579]

Current Properties:
  [core]
    project: [ecstatic-gantry-579]
    account: [sivann@inaccess.com]
    user_output_enabled: [True]
  [compute]
    zone: [europe-west1-b]

Logs Directory: [/root/.config/gcloud/logs]
Last Log File: [/root/.config/gcloud/logs/2014.11.17/07.48.50.711756.log]

Original issue reported on code.google.com by siv...@inaccess.com on 17 Nov 2014 at 8:04

GoogleCodeExporter commented 9 years ago

Thank you for the feedback. We have filed a bug in our tracking system for 
this, and will work on fixing it.

Original comment by vil...@google.com on 25 Nov 2014 at 4:14

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

I have runned into this to. I then used the webinterface to import the same 
image. It said nothing about it went wrong but still not a working image. The 
image shows up in the list of images, in the same way as failed command line. 
The userinterface needs to be more clear when things fails.

Original comment by ma...@spotify.com on 28 Nov 2014 at 7:37

GoogleCodeExporter commented 9 years ago

This happens now even with small disks (10GB) 1-2 times out of 10.

Original comment by siv...@inaccess.com on 12 Dec 2014 at 8:33

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I am facing this issue consistantly with disk of size 200 GB...
Its very irritating. Is there any temporary workaround for this??

Original comment by tahir.ra...@gmail.com on 23 Dec 2014 at 9:12

GoogleCodeExporter commented 9 years ago

Hi Tahir,

The next release of gcloud will have a longer timeout, and in the future this 
will be more configurable.

The error message is unclear.  This is gcloud locally *assuming* that the 
command failed because it took a long time, but it still (most likely) did 
succeed on the server.  The short term work around is to wait a bit longer and 
to check that your snapshot was in fact created.  You can check this with 
`gcloud compute snapshots list` or through the web console at 
https://console.developers.google.com/.

Sorry about the trouble!

Original comment by jeffvaug...@google.com on 23 Dec 2014 at 5:19

GoogleCodeExporter commented 9 years ago

@Google, please give this issue some priority, it may lead to data loss for a 
lot of people, since errors can go undetected; catching the status afterwards 
with polling is unreliable without predefined error conditions. We have also 
raised this through our support channel a long time ago.

Original comment by siv...@inaccess.com on 24 Dec 2014 at 8:48

GoogleCodeExporter commented 9 years ago

Original comment by rdayal@google.com on 26 Dec 2014 at 6:19

Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

@sivann: can you please comment on possible data loss scenarios?  I haven't 
seen evidence of this, and would like to better understand how this could occur.

Also, the next release of gcloud will, in addition to the longer time out, have 
a more clear message explaining how to check-up on resources that should be 
created (etc.) by pending operations.  We are also planning to work on 
providing a better and more consistent handling of timeouts across gcloud.

Original comment by jeffvaug...@google.com on 29 Dec 2014 at 7:54

GoogleCodeExporter commented 9 years ago

Since we now ignore errors from snapshots, data loss can occur if the snapshots 
actually fail to execute and this will go unnoticed. If we actually need to 
restore data from a non-existing snapshot, we won't be able to. 
Inter-region Snapshots are the single most important advantage of cloud 
services (for both gce and aws) and they should work flawlessly.
 I'm reluctant to trust the snapshotting mechanism right now since there is no official method to check for successful completion.
We would prefer a blocking "create snapshot" call, than scripting a complicated 
mechanism of polling. We'll sleep better :-)
Alternatively google could create a web-interface to schedule snapshots and 
report back on errors.

Kind Regards,
-Spiros

Original comment by siv...@inaccess.com on 31 Dec 2014 at 9:35

GoogleCodeExporter commented 9 years ago

Any updates on this?

Original comment by siv...@inaccess.com on 24 Feb 2015 at 10:42

GoogleCodeExporter commented 9 years ago

I just ran into this while trying to snapshot a 500GB SSD Persistent Disk.
ERROR: (gcloud.compute.disks.snapshot) Some requests did not succeed:
 - Did not createSnapshot the following resources within 660s: https://www.googleapis.com/compute/v1/projects/<proj_name>/zones/us-central1-f/disks/highmem8-1-aggr0. These operations may still be underway remotely and may still succeed; use gcloud list and describe commands or https://console.developers.google.com/ to check resource state

1 hour later and the snapshot is still showing as failed:

<...>@temp-sdk:~$ gcloud compute snapshots list highmem8-1-aggr0-snapshot
NAME                      DISK_SIZE_GB SRC_DISK                             
STATUS
highmem8-1-aggr0-snapshot 0            us-central1-f/disks/highmem8-1-aggr0 
FAILED

I tried a similar snapshot against a 200GB volume and it completed in less than 
5 minutes.

Seems like this is still a problem.

Original comment by b...@averesystems.com on 5 Aug 2015 at 7:25

GoogleCodeExporter commented 9 years ago

#12: did you try the same snapshot again? FAILED suggests to me that the 
snapshot actually failed, and didn't just timeout.

We're going to bump the timeouts substantially within (likely) the next two 
releases, and will also add an override flag. (It looks like disk sizes keep 
growing.)

In the meantime, please use the web console for snapshots, or one of the 
strategies suggested in #6.

Original comment by z...@google.com on 6 Aug 2015 at 3:21

gana2188 / google-cloud-sdk

gcloud compute disks snapshot always times out. Needs a timeout option. #92