Closed bacongobbler closed 8 years ago
follow up to the discussion in https://github.com/deis/workflow/pull/95 (see https://github.com/deis/workflow/pull/95#issuecomment-165535559 and below)
also note that this may be an issue in slugbuilder and not builder. still under investigation
@bacongobbler
Deis master
Are you running a helm install
ed deis cluster?
and more specifically, are you using minio as your object storage system, or something else?
Are you running a helm installed deis cluster?
Yep! helm install deis --namespace=deis
.
and more specifically, are you using minio as your object storage system, or something else?
I'm using minio :)
><> kd get po
NAME READY STATUS RESTARTS AGE
deis-builder-3cx18 1/1 Running 1 20m
deis-database-2lapf 1/1 Running 0 20m
deis-etcd-1-ang52 1/1 Running 0 20m
deis-etcd-1-iulk7 1/1 Running 1 20m
deis-etcd-1-qauzh 1/1 Running 0 20m
deis-etcd-discovery-1s85b 1/1 Running 0 20m
deis-minio-bzzwa 1/1 Running 0 20m
deis-registry-dqz52 1/1 Running 0 20m
deis-router-txh7z 1/1 Running 0 20m
deis-workflow-d5gzu 1/1 Running 0 20m
><> kd logs deis-minio-bzzwa
minio server /home/minio/
AccessKey: 8TZRY2JRWMPT6UMXR6I5 SecretKey: gbstrOvotMMcg2sMfGUhA5a6Et/EI5ALtIHsobYk
To configure Minio Client.
$ wget https://dl.minio.io:9000/updates/2015/Nov/linux-amd64/mc
$ chmod 755 mc
$ ./mc config host add http://localhost:9000 8TZRY2JRWMPT6UMXR6I5 gbstrOvotMMcg2sMfGUhA5a6Et/EI5ALtIHsobYk
$ ./mc mb localhost/photobucket
$ ./mc cp ~/Photos... localhost/photobucket
Starting minio server:
Listening on http://127.0.0.1:9000
Listening on http://10.246.10.28:9000
This is transient . I'm not able to reproduce it
Found something that seems relevant in the Python buildpack:
+ echo_title 'Python app detected'
----->' Python app detected
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
-----> Python app detected
+ /tmp/buildpacks/heroku-buildpack-python.git/bin/compile /tmp/build /tmp/cache
+ ensure_indent
+ read line
+ [[ -----> Installing runtime (python-2.7.10) == --* ]]
----->' Installing runtime '(python-2.7.10)'
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
-----> Installing runtime (python-2.7.10)
+ read line
+ [[ ! Requested runtime (python-2.7.10) is not available for this stack (cedar-14). == --* ]]
' '! Requested runtime (python-2.7.10) is not available for this stack (cedar-14).'
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
! Requested runtime (python-2.7.10) is not available for this stack (cedar-14).
+ read line
+ [[ ! Aborting. More info: https://devcenter.heroku.com/articles/python-support == --* ]]
' '! Aborting. More info: https://devcenter.heroku.com/articles/python-support'
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
! Aborting. More info: https://devcenter.heroku.com/articles/python-support
+ read line
Looks like the same place that the Go buildpack is failing too. I suspect a change in slugbuilder that's causing the build to fail.
@bacongobbler there are errors that say something similar to 'not available' for Go? asking because the slugbuilder build.sh file hasn't changed much recently
There is something similar for Ruby:
! Command: 'set -o pipefail; curl --fail --retry 3 --retry-delay 1 --connect-timeout 3 --max-time 30 https://s3-external-1.amazonaws.com/heroku-buildpack-ruby/bundler-1.9.7.tgz -s -o - | tar zxf - ' failed unexpectedly:
+ read line
+ [[ ! == --* ]]
' '!'
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
!
+ read line
+ [[ ! gzip: stdin: unexpected end of file == --* ]]
' '! gzip: stdin: unexpected end of file'
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
! gzip: stdin: unexpected end of file
+ read line
+ [[ ! tar: Child returned status 1 == --* ]]
' '! tar: Child returned status 1'
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
! tar: Child returned status 1
+ read line
+ [[ ! tar: Error is not recoverable: exiting now == --* ]]
' '! tar: Error is not recoverable: exiting now'
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
! tar: Error is not recoverable: exiting now
+ read line
+ [[ ! == --* ]]
+ output_redirect
+ [[ /tmp/slug.tgz == \- ]]
+ cat -
' '!'
!
+ read line
Can't see anything similar for Go, though. It just exited abruptly right after here. Perhaps there's some networking fubar going on in here?
The ruby error may be different - it looks like the slug may have been downloaded incorrectly (or not at all). Can you show those entire logs if you still have them?
+1 on the fubar point. Something smells off here
Unfortunately I blew away the cluster. I'm in the middle of rebuilding it but I'll get you those logs
k, thanks
full logs here: https://gist.github.com/bacongobbler/257ab22d7d7e2ff69994
Yea, that error looks to be different from the python one. It appears to be downloading no tarball at all...
Something's definitely up. I tried running slugbuilder from my own host without k8s and it succeeded:
><> docker run -v $PWD/example-ruby-sinatra:/app quay.io/deisci/slugbuilder:v2-alpha
unable to write 'random state'
-----> Ruby app detected
-----> Compiling Ruby/Rack
-----> Using Ruby version: ruby-2.0.0
-----> Installing dependencies using 1.9.7
Running: bundle install --without development:test --path vendor/bundle --binstubs vendor/bundle/bin -j4 --deployment
Fetching gem metadata from http://rubygems.org/..........
Fetching version metadata from http://rubygems.org/..
Rubygems 2.0.14.1 is not threadsafe, so your gems must be installed one at a time. Upgrade to Rubygems 2.1.0 or higher to enable parallel gem installation.
Installing rack 1.6.1
Installing rack-protection 1.5.3
Installing tilt 2.0.1
Installing sinatra 1.4.6
Using bundler 1.9.7
Bundle complete! 1 Gemfile dependency, 5 gems now installed.
Gems in the groups development and test were not installed.
Bundled gems are installed into ./vendor/bundle.
Bundle completed (4.51s)
Cleaning up the bundler cache.
-----> Discovering process types
Procfile declares types -> web
Default process types for Ruby -> rake, console, web
-----> Compiled slug size is 16M
I'll keep digging, just documenting what I'm finding.
That was a great test, actually. By running without TAR_URL
(used at https://github.com/deis/slugbuilder/blob/master/rootfs/builder/build.sh#L18-L34) and put_url
(used at https://github.com/deis/slugbuilder/blob/master/rootfs/builder/build.sh#L197-L210) in the env, you've ruled out all the minio communication...
So after digging into this, it does look like a DNS issue. After trying to hit the same endpoint as the ruby buildpack (to fetch bundler), this is what I got with curl -L -v -o /dev/null
:
* Hostname was NOT found in DNS cache
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 23.253.174.228...
* Connected to s3-external-1.amazonaws.com (23.253.174.228) port 443 (#0)
* successfully set certificate verify locations:
* CAfile: none
CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
} [data not shown]
* SSLv3, TLS handshake, Server hello (2):
{ [data not shown]
* SSLv3, TLS handshake, CERT (11):
{ [data not shown]
* SSLv3, TLS handshake, Server key exchange (12):
{ [data not shown]
* SSLv3, TLS handshake, Server finished (14):
{ [data not shown]
* SSLv3, TLS handshake, Client key exchange (16):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
} [data not shown]
* SSLv3, TLS handshake, Finished (20):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
{ [data not shown]
* SSLv3, TLS handshake, Finished (20):
{ [data not shown]
* SSL connection using ECDHE-RSA-AES256-SHA384
* Server certificate:
* subject: OU=Domain Control Validated; CN=www.moneymoveit.com
* start date: 2014-10-26 04:31:00 GMT
* expire date: 2016-08-07 03:31:35 GMT
* subjectAltName does not match s3-external-1.amazonaws.com
* SSL: no alternative certificate subject name matches target host name 's3-external-1.amazonaws.com'
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
* Closing connection 0
* SSLv3, TLS alert, Client hello (1):
} [data not shown]
curl: (51) SSL: no alternative certificate subject name matches target host name 's3-external-1.amazonaws.com'
Notice that the certificate's common name is moneymoveit.com
rather than S3's certificate. It's likely a DNS issue in my cluster. Because of this, it's likely an environment issue so I'm gonna migrate my cluster onto AWS and continue testing from there.
also, I tested on a stock heroku/cedar:14
(which, as of this writing, slugbuilder is built from) and had no problems curl
ing:
ENG000656:slugbuilder aaronschlesinger$ docker run --rm -it heroku/cedar:14 /bin/bash
root@26d409fd507c:/# curl -v -o /dev/null s3-external-1.amazonaws.com
* Rebuilt URL to: s3-external-1.amazonaws.com/
* Hostname was NOT found in DNS cache
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 54.231.80.108...
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0* Connected to s3-external-1.amazonaws.com (54.231.80.108) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Host: s3-external-1.amazonaws.com
> Accept: */*
>
< HTTP/1.1 307 Temporary Redirect
< x-amz-id-2: H3NhI6aJVZValpcTgJnUu5IClOUdfSt0mMJfWZlRy1x9NysGrCBJkds+MSCKURsvK8+4GejG9NA=
< x-amz-request-id: 20B656AE7D4F4CC0
< Date: Fri, 18 Dec 2015 00:49:36 GMT
< Location: http://aws.amazon.com/s3/
< Content-Length: 0
* Server AmazonS3 is not blacklisted
< Server: AmazonS3
<
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
* Connection #0 to host s3-external-1.amazonaws.com left intact
this test suggests that the issue is isolated. demoting to v2.0-beta1
duplicate of #57, leaving open for historical purposes
At what point can we close this one.
duplicate of #57, leaving open for historical purposes
Closing as irrelevant. This was because my ISP was automatically providing me a DNS search domain bacongobbler.com
, which resolvconf
would write to /etc/resolv.conf, making all requests go through https://s3-external-1.amazonaws.com.bacongobbler.com, which at that point goes through moneymoveit.com. DNS settings are shared between the host and a virtual machine, which kubernetes will pick up on the minions and use inside the containers. This caused the issue to pop up.
After resolving resolvconf
on my host to stop adding bacongobbler.com
to my DNS nameservers and to use the following resolv.conf
:
><> cat /etc/resolv.conf
nameserver 8.8.8.8
nameserver 8.8.4.4
Everything is working as intended on my VMs. For others' reference, this is how I did that:
><> cat /etc/resolvconf/resolv.conf.d/base
nameserver 8.8.8.8
nameserver 8.8.4.4
search bacongobbler.local
This one's going in the record books under "weirdest bugs to ever encounter in the wild" :)
I hit an error while pushing a build:
The logs from the pod shows nothing too useful: https://gist.github.com/bacongobbler/ce6be39446ff43d2f4c1
Where it gets interesting is here:
These are two different builds, but both failed the same way. The exit code is 51, which on a first hit from Google is an SSL vert validation error. That might not be the problem, but hey it's a trail.
My layout:
2-node vagrant cluster (1 master, 1 minion) kubernetes v1.1.3 Deis master