coreos / etcd-operator

etcd operator creates/configures/manages etcd clusters atop Kubernetes
https://coreos.com/blog/introducing-the-etcd-operator.html
Apache License 2.0
1.75k stars 741 forks source link

Restore from GCS fails. #2117

Closed nicholasklem closed 5 years ago

nicholasklem commented 5 years ago

Restore from GCS fails. It looks like the curl download of the backup file produces a file with "placeholder" as content:

# curl -vv http://etcd-restore-operator:19999/v1/backup/etcd-prism && echo
*   Trying 10.0.47.87...
* Connected to etcd-restore-operator (10.0.47.87) port 19999 (#0)
> GET /v1/backup/etcd-prism HTTP/1.1
> Host: etcd-restore-operator:19999
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Sat, 14 Sep 2019 08:02:59 GMT
< Content-Length: 11
< Content-Type: text/plain; charset=utf-8
< 
* Connection #0 to host etcd-restore-operator left intact
placeholder

Backup works, files are in place in the format: gs://...backup-etcd/kubernetes03-eu-w1-01/_v1_2019-09-14-02:51:15

and they are actual etcd snapshot files.

Restore fails with when the init container tries to use the snapshot file:

...
    State:      Terminated
      Reason:   Error
      Message:  2019-09-14 07:51:14.176035 I | pkg/netutil: resolving etcd-prism-tcvd5hn7c4.etcd-prism.etcd.svc:2380 to 10.40.0.15:2380
2019-09-14 07:51:14.179186 I | pkg/netutil: resolving etcd-prism-tcvd5hn7c4.etcd-prism.etcd.svc:2380 to 10.40.0.15:2380
Error: seek /var/etcd/latest.backup: invalid argument

which makes sense seeing that /var/etcd/latest.backup is not a valid file.

Any pointers to where "placeholder" comes from are appreciated. Better to use S3 for backups and restore?

nicholasklem commented 5 years ago

Meh. Turns out it was user error.

Running etcd-restore-operator with

...
        env:
        - name: GODEBUG
          value: http2debug=1

Gives

2019/09/15 07:35:55 http2: Transport encoding header ":authority" = "storage.googleapis.com"
2019/09/15 07:35:55 http2: Transport encoding header ":method" = "GET"
2019/09/15 07:35:55 http2: Transport encoding header ":path" = "/backup-etcd/kubernetes03-eu-w1-01/"
2019/09/15 07:35:55 http2: Transport encoding header ":scheme" = "https"
2019/09/15 07:35:55 http2: Transport encoding header "authorization" = "Bearer ..."
2019/09/15 07:35:55 http2: Transport encoding header "x-cloud-trace-context" = ..."
2019/09/15 07:35:55 http2: Transport encoding header "user-agent" = "gcloud-golang-storage/20151204"
2019/09/15 07:35:55 http2: Transport encoding header "accept-encoding" = "gzip"
2019/09/15 07:35:55 http2: Transport received HEADERS flags=END_HEADERS stream=1 len=461
2019/09/15 07:35:55 http2: Transport received DATA stream=1 len=11 data="placeholder"
2019/09/15 07:35:55 http2: Transport received DATA flags=END_STREAM stream=1 len=0 data=""

So placeholder is Google Clouds anser to a request ending in "/". I wrongly assumed asking etcd-restore-operator for / would just return the latest object.

So not a bug.

etcd-restore-operator could be more helptul in logging and it could verify that backupUrl does not end in / . But in the end an user-error.