spawning slugbuilder pod errors and fails with exitCode 51

bacongobbler commented 8 years ago

I hit an error while pushing a build:

><> deis create go
Creating Application... done, created go
Git remote deis added
remote available at ssh://git@deis.example.com:2222/go.git
><> git push deis master
The authenticity of host '[deis.example.com]:2222 ([10.245.1.3]:2222)' can't be established.
ECDSA key fingerprint is 99:81:06:40:6a:7e:12:0b:fd:ad:b3:60:4d:93:79:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[deis.example.com]:2222,[10.245.1.3]:2222' (ECDSA) to the list of known hosts.
Counting objects: 75, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (40/40), done.
Writing objects: 100% (75/75), 18.20 KiB | 0 bytes/s, done.
Total 75 (delta 30), reused 75 (delta 30)
mc: Configuration written to [/var/minio-conf/config.json]. Please update your access credentials.
mc: Successfully created ‘/var/minio-conf/share’.
mc: Initialized share uploads ‘/var/minio-conf/share/uploads.json’ file.
mc: Initialized share downloads ‘/var/minio-conf/share/downloads.json’ file.
Added host ‘http://10.247.137.143:9000’ successfully.
Bucket created successfully ‘http://10.247.137.143:9000/git’.
‘go.tar.gz’ -> ‘http://10.247.137.143:9000/git/home/go:git-5450cbcd/tar’
Total: 2.76 KB, Transferred: 2.76 KB, Speed: 1.60 MB/s
-----> stored tarfile in http://10.247.137.143:9000/git/home/go:git-5450cbcd/tar
-----> creating builder pod in namespace deis
pod "go-git-5450cbcd" created
^C
$ kubectl get pod go-git-5450cbcd
NAME              READY     STATUS    RESTARTS   AGE
go-git-5450cbcd   0/1       Error     0          4m

The logs from the pod shows nothing too useful: https://gist.github.com/bacongobbler/ce6be39446ff43d2f4c1

Where it gets interesting is here:

><> kd describe po foo-git-5450cbcd
Name:                           foo-git-5450cbcd
Namespace:                      deis
Image(s):                       quay.io/deisci/slugbuilder:v2-alpha
Node:                           10.245.1.3/10.245.1.3
Labels:                         heritage=deis,version=v2-alpha
Status:                         Failed
Reason:
Message:
IP:
Replication Controllers:        <none>
Containers:
  deis-slugbuilder:
    Image:              quay.io/deisci/slugbuilder:v2-alpha
    State:              Terminated
      Reason:           Error
      Exit Code:        51
      Started:          Thu, 17 Dec 2015 11:57:15 -0800
      Finished:         Thu, 17 Dec 2015 11:57:16 -0800
    Ready:              False
    Restart Count:      0
Conditions:
  Type          Status
  Ready         False 
Events:
  FirstSeen                             LastSeen                        Count   From                    SubobjectPath                           Reason          Message
  Thu, 17 Dec 2015 11:57:09 -0800       Thu, 17 Dec 2015 11:57:09 -0800 1       {kubelet 10.245.1.3}    implicitly required container POD       Pulled          Container image "gcr.io/google_containers/pause:0.8.0" already present on machine
  Thu, 17 Dec 2015 11:57:09 -0800       Thu, 17 Dec 2015 11:57:09 -0800 1       {scheduler }                                                    Scheduled       Successfully assigned foo-git-5450cbcd to 10.245.1.3
  Thu, 17 Dec 2015 11:57:10 -0800       Thu, 17 Dec 2015 11:57:10 -0800 1       {kubelet 10.245.1.3}    implicitly required container POD       Started         Started with docker id 018b6796d186
  Thu, 17 Dec 2015 11:57:10 -0800       Thu, 17 Dec 2015 11:57:10 -0800 1       {kubelet 10.245.1.3}    spec.containers{deis-slugbuilder}       Pulling         Pulling image "quay.io/deisci/slugbuilder:v2-alpha"
  Thu, 17 Dec 2015 11:57:10 -0800       Thu, 17 Dec 2015 11:57:10 -0800 1       {kubelet 10.245.1.3}    implicitly required container POD       Created         Created with docker id 018b6796d186
  Thu, 17 Dec 2015 11:57:14 -0800       Thu, 17 Dec 2015 11:57:14 -0800 1       {kubelet 10.245.1.3}    spec.containers{deis-slugbuilder}       Pulled          Successfully pulled image "quay.io/deisci/slugbuilder:v2-alpha"
  Thu, 17 Dec 2015 11:57:15 -0800       Thu, 17 Dec 2015 11:57:15 -0800 1       {kubelet 10.245.1.3}    spec.containers{deis-slugbuilder}       Created         Created with docker id 91bdc6111eee
  Thu, 17 Dec 2015 11:57:15 -0800       Thu, 17 Dec 2015 11:57:15 -0800 1       {kubelet 10.245.1.3}    spec.containers{deis-slugbuilder}       Started         Started with docker id 91bdc6111eee
  Thu, 17 Dec 2015 11:57:25 -0800       Thu, 17 Dec 2015 11:57:25 -0800 1       {kubelet 10.245.1.3}    implicitly required container POD       Killing         Killing with docker id 018b6796d186

These are two different builds, but both failed the same way. The exit code is 51, which on a first hit from Google is an SSL vert validation error. That might not be the problem, but hey it's a trail.

My layout:

2-node vagrant cluster (1 master, 1 minion) kubernetes v1.1.3 Deis master

arschles commented 8 years ago

follow up to the discussion in https://github.com/deis/workflow/pull/95 (see https://github.com/deis/workflow/pull/95#issuecomment-165535559 and below)

also note that this may be an issue in slugbuilder and not builder. still under investigation

arschles commented 8 years ago

@bacongobbler

Deis master

Are you running a helm installed deis cluster?

arschles commented 8 years ago

and more specifically, are you using minio as your object storage system, or something else?

bacongobbler commented 8 years ago

Are you running a helm installed deis cluster?

Yep! helm install deis --namespace=deis.

and more specifically, are you using minio as your object storage system, or something else?

I'm using minio :)

><> kd get po
NAME                        READY     STATUS    RESTARTS   AGE
deis-builder-3cx18          1/1       Running   1          20m
deis-database-2lapf         1/1       Running   0          20m
deis-etcd-1-ang52           1/1       Running   0          20m
deis-etcd-1-iulk7           1/1       Running   1          20m
deis-etcd-1-qauzh           1/1       Running   0          20m
deis-etcd-discovery-1s85b   1/1       Running   0          20m
deis-minio-bzzwa            1/1       Running   0          20m
deis-registry-dqz52         1/1       Running   0          20m
deis-router-txh7z           1/1       Running   0          20m
deis-workflow-d5gzu         1/1       Running   0          20m
><> kd logs deis-minio-bzzwa
minio server /home/minio/

AccessKey: 8TZRY2JRWMPT6UMXR6I5  SecretKey: gbstrOvotMMcg2sMfGUhA5a6Et/EI5ALtIHsobYk

To configure Minio Client.

        $ wget https://dl.minio.io:9000/updates/2015/Nov/linux-amd64/mc
        $ chmod 755 mc
        $ ./mc config host add http://localhost:9000 8TZRY2JRWMPT6UMXR6I5 gbstrOvotMMcg2sMfGUhA5a6Et/EI5ALtIHsobYk
        $ ./mc mb localhost/photobucket
        $ ./mc cp ~/Photos... localhost/photobucket

Starting minio server:
Listening on http://127.0.0.1:9000
Listening on http://10.246.10.28:9000

smothiki commented 8 years ago

This is transient . I'm not able to reproduce it

bacongobbler commented 8 years ago

Found something that seems relevant in the Python buildpack:

+ echo_title 'Python app detected'
----->' Python app detected
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
-----> Python app detected
+ /tmp/buildpacks/heroku-buildpack-python.git/bin/compile /tmp/build /tmp/cache
+ ensure_indent
+ read line
+ [[ -----> Installing runtime (python-2.7.10) == --* ]]
----->' Installing runtime '(python-2.7.10)'
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
-----> Installing runtime (python-2.7.10)
+ read line
+ [[ !     Requested runtime (python-2.7.10) is not available for this stack (cedar-14). == --* ]]
      ' '!     Requested runtime (python-2.7.10) is not available for this stack (cedar-14).'
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
       !     Requested runtime (python-2.7.10) is not available for this stack (cedar-14).
+ read line
+ [[ !     Aborting.  More info: https://devcenter.heroku.com/articles/python-support == --* ]]
      ' '!     Aborting.  More info: https://devcenter.heroku.com/articles/python-support'
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
       !     Aborting.  More info: https://devcenter.heroku.com/articles/python-support
+ read line

Looks like the same place that the Go buildpack is failing too. I suspect a change in slugbuilder that's causing the build to fail.

arschles commented 8 years ago

@bacongobbler there are errors that say something similar to 'not available' for Go? asking because the slugbuilder build.sh file hasn't changed much recently

bacongobbler commented 8 years ago

There is something similar for Ruby:

       !     Command: 'set -o pipefail; curl --fail --retry 3 --retry-delay 1 --connect-timeout 3 --max-time 30 https://s3-external-1.amazonaws.com/heroku-buildpack-ruby/bundler-1.9.7.tgz -s -o - | tar zxf - ' failed unexpectedly:
+ read line
+ [[ ! == --* ]]
      ' '!'
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
       !
+ read line
+ [[ !     gzip: stdin: unexpected end of file == --* ]]
      ' '!     gzip: stdin: unexpected end of file'
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
       !     gzip: stdin: unexpected end of file
+ read line
+ [[ !     tar: Child returned status 1 == --* ]]
      ' '!     tar: Child returned status 1'
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
       !     tar: Child returned status 1
+ read line
+ [[ !     tar: Error is not recoverable: exiting now == --* ]]
      ' '!     tar: Error is not recoverable: exiting now'
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
       !     tar: Error is not recoverable: exiting now
+ read line
+ [[ ! == --* ]]
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
      ' '!'
       !
+ read line

Can't see anything similar for Go, though. It just exited abruptly right after here. Perhaps there's some networking fubar going on in here?

arschles commented 8 years ago

The ruby error may be different - it looks like the slug may have been downloaded incorrectly (or not at all). Can you show those entire logs if you still have them?

+1 on the fubar point. Something smells off here

bacongobbler commented 8 years ago

Unfortunately I blew away the cluster. I'm in the middle of rebuilding it but I'll get you those logs

arschles commented 8 years ago

k, thanks

bacongobbler commented 8 years ago

full logs here: https://gist.github.com/bacongobbler/257ab22d7d7e2ff69994

arschles commented 8 years ago

Yea, that error looks to be different from the python one. It appears to be downloading no tarball at all...

bacongobbler commented 8 years ago

Something's definitely up. I tried running slugbuilder from my own host without k8s and it succeeded:

><> docker run -v $PWD/example-ruby-sinatra:/app quay.io/deisci/slugbuilder:v2-alpha
unable to write 'random state'
-----> Ruby app detected
-----> Compiling Ruby/Rack
-----> Using Ruby version: ruby-2.0.0
-----> Installing dependencies using 1.9.7
       Running: bundle install --without development:test --path vendor/bundle --binstubs vendor/bundle/bin -j4 --deployment
       Fetching gem metadata from http://rubygems.org/..........
       Fetching version metadata from http://rubygems.org/..
       Rubygems 2.0.14.1 is not threadsafe, so your gems must be installed one at a time. Upgrade to Rubygems 2.1.0 or higher to enable parallel gem installation.
       Installing rack 1.6.1
       Installing rack-protection 1.5.3
       Installing tilt 2.0.1
       Installing sinatra 1.4.6
       Using bundler 1.9.7
       Bundle complete! 1 Gemfile dependency, 5 gems now installed.
       Gems in the groups development and test were not installed.
       Bundled gems are installed into ./vendor/bundle.
       Bundle completed (4.51s)
       Cleaning up the bundler cache.

-----> Discovering process types
       Procfile declares types -> web
       Default process types for Ruby -> rake, console, web
-----> Compiled slug size is 16M

I'll keep digging, just documenting what I'm finding.

arschles commented 8 years ago

That was a great test, actually. By running without TAR_URL (used at https://github.com/deis/slugbuilder/blob/master/rootfs/builder/build.sh#L18-L34) and put_url (used at https://github.com/deis/slugbuilder/blob/master/rootfs/builder/build.sh#L197-L210) in the env, you've ruled out all the minio communication...

bacongobbler commented 8 years ago

So after digging into this, it does look like a DNS issue. After trying to hit the same endpoint as the ruby buildpack (to fetch bundler), this is what I got with curl -L -v -o /dev/null:

* Hostname was NOT found in DNS cache
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 23.253.174.228...
* Connected to s3-external-1.amazonaws.com (23.253.174.228) port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
} [data not shown]
* SSLv3, TLS handshake, Server hello (2):
{ [data not shown]
* SSLv3, TLS handshake, CERT (11):
{ [data not shown]
* SSLv3, TLS handshake, Server key exchange (12):
{ [data not shown]
* SSLv3, TLS handshake, Server finished (14):
{ [data not shown]
* SSLv3, TLS handshake, Client key exchange (16):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
} [data not shown]
* SSLv3, TLS handshake, Finished (20):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
{ [data not shown]
* SSLv3, TLS handshake, Finished (20):
{ [data not shown]
* SSL connection using ECDHE-RSA-AES256-SHA384
* Server certificate:
*        subject: OU=Domain Control Validated; CN=www.moneymoveit.com
*        start date: 2014-10-26 04:31:00 GMT
*        expire date: 2016-08-07 03:31:35 GMT
*        subjectAltName does not match s3-external-1.amazonaws.com
* SSL: no alternative certificate subject name matches target host name 's3-external-1.amazonaws.com'
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Closing connection 0
* SSLv3, TLS alert, Client hello (1):
} [data not shown]
curl: (51) SSL: no alternative certificate subject name matches target host name 's3-external-1.amazonaws.com'

Notice that the certificate's common name is moneymoveit.com rather than S3's certificate. It's likely a DNS issue in my cluster. Because of this, it's likely an environment issue so I'm gonna migrate my cluster onto AWS and continue testing from there.

arschles commented 8 years ago

also, I tested on a stock heroku/cedar:14 (which, as of this writing, slugbuilder is built from) and had no problems curling:

ENG000656:slugbuilder aaronschlesinger$ docker run --rm -it heroku/cedar:14 /bin/bash
root@26d409fd507c:/# curl -v -o /dev/null s3-external-1.amazonaws.com
* Rebuilt URL to: s3-external-1.amazonaws.com/
* Hostname was NOT found in DNS cache
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 54.231.80.108...
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0* Connected to s3-external-1.amazonaws.com (54.231.80.108) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Host: s3-external-1.amazonaws.com
> Accept: */*
> 
< HTTP/1.1 307 Temporary Redirect
< x-amz-id-2: H3NhI6aJVZValpcTgJnUu5IClOUdfSt0mMJfWZlRy1x9NysGrCBJkds+MSCKURsvK8+4GejG9NA=
< x-amz-request-id: 20B656AE7D4F4CC0
< Date: Fri, 18 Dec 2015 00:49:36 GMT
< Location: http://aws.amazon.com/s3/
< Content-Length: 0
* Server AmazonS3 is not blacklisted
< Server: AmazonS3
< 
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
* Connection #0 to host s3-external-1.amazonaws.com left intact

this test suggests that the issue is isolated. demoting to v2.0-beta1

arschles commented 8 years ago

duplicate of #57, leaving open for historical purposes

technosophos commented 8 years ago

At what point can we close this one.

duplicate of #57, leaving open for historical purposes

bacongobbler commented 8 years ago

Closing as irrelevant. This was because my ISP was automatically providing me a DNS search domain bacongobbler.com, which resolvconf would write to /etc/resolv.conf, making all requests go through https://s3-external-1.amazonaws.com.bacongobbler.com, which at that point goes through moneymoveit.com. DNS settings are shared between the host and a virtual machine, which kubernetes will pick up on the minions and use inside the containers. This caused the issue to pop up.

After resolving resolvconf on my host to stop adding bacongobbler.com to my DNS nameservers and to use the following resolv.conf:

><> cat /etc/resolv.conf
nameserver 8.8.8.8
nameserver 8.8.4.4

Everything is working as intended on my VMs. For others' reference, this is how I did that:

><> cat /etc/resolvconf/resolv.conf.d/base 
nameserver 8.8.8.8
nameserver 8.8.4.4
search bacongobbler.local

This one's going in the record books under "weirdest bugs to ever encounter in the wild" :)

deis / builder

spawning slugbuilder pod errors and fails with exitCode 51 #55