deis / builder

Git server and application builder for Deis Workflow
https://deis.com
MIT License
40 stars 41 forks source link

Builder fails constantly with panic in crypto/ssh #462

Closed Bregor closed 7 years ago

Bregor commented 7 years ago

Environment:

Fail ratio:

$ kubectl get po -n deis
NAME                                     READY     STATUS    RESTARTS   AGE
deis-builder-3815239197-sgwpb            1/1       Running   19         11h

Log:

$ kubectl logs -n deis -p deis-builder-3815239197-sgwpb
2017/01/11 18:05:49 Starting health check server on port 8092
2017/01/11 18:05:49 Starting deleted app cleaner
2017/01/11 18:05:49 Starting SSH server on 0.0.0.0:2223
Listening on 0.0.0.0:2223
Accepting new connections.
Accepted connection.
---> ---> ---> panic: runtime error: slice bounds out of range

goroutine 884 [running]:
panic(0x1170b00, 0xc4200180c0)
    /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dsaPrivateKey).Sign(0xc420182148, 0x1bbf2e0, 0xc42007c5a0, 0xc4204da740, 0x14, 0x20, 0x0, 0x780, 0xc4204da740)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/keys.go:379 +0x3bf
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.signAndMarshal(0x1bc8860, 0xc420182148, 0x1bbf2e0, 0xc42007c5a0, 0xc4204da740, 0x14, 0x20, 0x11030a0, 0x10caae3, 0x11a4180, ...)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/server.go:157 +0x6d
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dhGroup).Server(0xc420118b50, 0x7f6c89ca8440, 0xc42041ab40, 0x1bbf2e0, 0xc42007c5a0, 0xc42058a000, 0x1bc8860, 0xc420182148, 0x40def4, 0x12f8db4, ...)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/kex.go:172 +0x4fb
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).server(0xc4205be000, 0x1bc8820, 0xc420118b50, 0xc42031a000, 0xc42058a000, 0x1, 0x0, 0xe3)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:366 +0x1db
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).enterKeyExchange(0xc4205be000, 0xc4201080e0, 0xd9, 0xd9, 0x0, 0x0)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:337 +0x378
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readOnePacket(0xc4205be000, 0x0, 0x0, 0x4341b8, 0x137e0b8, 0x0)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:166 +0xb4
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readLoop(0xc4205be000)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:133 +0x2f
created by github.com/deis/builder/vendor/golang.org/x/crypto/ssh.newServerTransport
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:108 +0x216
vdice commented 7 years ago

We are also seeing this in our long-running cluster, which happens to be:

k8s version 1.4.7 on GKE Workflow v2.9.1 Helm 2.1.3

Bregor commented 7 years ago

Yes, the problem appears in 2.9. 2.8 was not affected.

vdice commented 7 years ago

Thanks @Bregor that is helpful... So there are 2 builder releases in the meantime: https://github.com/deis/builder/releases/tag/v2.6.0 and https://github.com/deis/builder/releases/tag/v2.6.1

There aren't any changes that look obvious towards this behavior so wondering if the issue is something a bit more subtle such as the bump of the golang toolchain when we updated the docker-go-dev image in this project (https://github.com/deis/builder/commit/3fb031a34f3b21a936e5996696d1642a0b20788b)

vdice commented 7 years ago

No leads yet; going to build/deploy image locally with added debug logging to try to pinpoint...

bacongobbler commented 7 years ago

From what we're finding, the error is here.

Perhaps this Go crypto/ssh issue is relevant, which would sound fair because of the basic liveness check performed by the server, which is just a basic TCP probe.

EDIT: Then again, that fix was merged in June 2015 so that fix would've always been there because Go 1.5 was released in Aug 2015. Still, there could be something here that leads to a bigger fish.

bacongobbler commented 7 years ago

I think I'm onto something here. In server_test.go, if I add this unit test that runs the server with no SSH config to intentionally cause it to panic and run make test:

func TestReceiveWithNoConfig(t *testing.T) {
    const testingServerAddr = "127.0.0.1:2246"

    c := NewCircuit()
    pushLock := NewInMemoryRepositoryLock(0)
    // NOTE(bacongobbler): notice that we're not initializing the server with any SSH config.
    runServer(nil, c, pushLock, testingServerAddr, time.Duration(0), t)

    // Give server time to initialize.
    time.Sleep(200 * time.Millisecond)
    assert.Equal(t, c.State(), ClosedState, "circuit state")

    // Connect to the server and issue env var set. This should return true.
    _, err := ssh.Dial("tcp", testingServerAddr, clientConfig())
    if err != nil {
        t.Fatalf("Failed to connect client to local server: %s", err)
    }
}

I get the following panic:

---> Listening on 127.0.0.1:2246
---> Accepting new connections.
---> Accepted connection.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5e3747]

goroutine 1091 [running]:
panic(0x8a3860, 0xc42000c0a0)
        /usr/local/go/src/runtime/panic.go:500 +0x1ae
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.NewServerConn(0xaf3020, 0xc420484018, 0x0, 0x0, 0x0, 0x1, 0xc420280778, 0x4057da)

So wrong incantation, but at least we're in the right location to nailing down a reproducible test case against the OP's logs. From the panic stacktrace, it looks to be when initializing a basic handshake without having any valid config from the client. I think the DSAPrivateKey callout is a wild goose chase/default path.

bacongobbler commented 7 years ago

For what it's worth, I've been running a local vagrant cluster on k8s v1.5.1 and Workflow v2.10.0 for over 19 hours without any restarts:

><> kd get po | grep builder
deis-builder-3170955059-k084s            1/1       Running   0          19h

Still trying to nail it down through a unit test though.

Bregor commented 7 years ago

@bacongobbler try to build something. Especially something failing to build/start.

bacongobbler commented 7 years ago

Building works fine, failing or not. Tried that, but thanks for the suggestion!

From @vdice's observations it seems to be something passive, because neither of us are actively killing the pod when it reboots.

vdice commented 7 years ago

From @vdice's observations it seems to be something passive, because neither of us are actively killing the pod when it reboots.

Correct, thank you @bacongobbler ; I should have mentioned that before. My GKE cluster (k8s 1.4.7) has the latest Workflow chart installed and is completely stock. Without even interacting with Deis/Workflow, (have not even registered) the builder pod restarts occur, usually multiple times an hour. Here's the number currently:

NAME                                     READY     STATUS    RESTARTS   AGE
deis-builder-2492834593-bh17g            1/1       Running   6          4h
joelso commented 7 years ago

We are having the same issue on cluster running on AWS (kube-aws):

Deis Workflow 2.9.1 Kube 1.4.6 Helm 2.1.0

Bregor commented 7 years ago

Maybe this could be useful:

Events:

    deis      2017-01-16 12:42:50 +0300 MSK   2017-01-12 09:25:20 +0300 MSK   30        deis-builder-762421820-k5ht5   Pod       spec.containers{deis-builder}   Warning   Unhealthy   {kubelet 10.83.8.200}   Liveness probe failed: Get http://192.168.0.12:8092/healthz: EOF
    deis      2017-01-16 12:42:51 +0300 MSK   2017-01-12 08:01:16 +0300 MSK   142       deis-builder-762421820-k5ht5   Pod       spec.containers{deis-builder}   Normal    Pulled    {kubelet 10.83.8.200}   Container image "quay.io/deis/builder:v2.6.1" already present on machine
    deis      2017-01-16 12:42:52 +0300 MSK   2017-01-16 12:42:52 +0300 MSK   1         deis-builder-762421820-k5ht5   Pod       spec.containers{deis-builder}   Normal    Created   {kubelet 10.83.8.200}   Created container with docker id 2a3c05004c81; Security:[seccomp=unconfined]
    deis      2017-01-16 12:42:53 +0300 MSK   2017-01-16 12:42:53 +0300 MSK   1         deis-builder-762421820-k5ht5   Pod       spec.containers{deis-builder}   Normal    Started   {kubelet 10.83.8.200}   Started container with docker id 2a3c05004c81

Pods:

    deis      deis-builder-762421820-k5ht5   0/1       Error     140      4d        192.168.0.12   10.83.8.200
    deis      deis-builder-762421820-k5ht5   0/1       Running   141       4d        192.168.0.12   10.83.8.200
    deis      deis-builder-762421820-k5ht5   1/1       Running   141       4d        192.168.0.12   10.83.8.200

Controller log:

    INFO [preview]: build preview-ebaf46d created
    INFO [preview]: gazay deployed d94cc6b
    INFO [preview]: This deployments overall timeout is 120s - batch timout is 120s and there are 1 batches to deploy with a total of 2 pods
    192.168.0.12 "POST /v2/hooks/build/ HTTP/1.1" 200 25 "deis-builder"
    192.168.0.12 "GET /v2/hooks/key/04:9b:f5:de:de:27:10:5a:b8:8d:ab:79:cd:17:e8:57 HTTP/1.1" 404 33 "deis-builder"
    192.168.0.12 "GET /v2/hooks/key/27:c6:ad:f9:a6:4d:22:3f:18:b0:3b:df:81:1c:57:45 HTTP/1.1" 404 33 "deis-builder"
    192.168.0.12 "GET /v2/hooks/key/dd:3b:b8:2e:85:04:06:e9:ab:ff:a8:0a:c0:04:6e:d6 HTTP/1.1" 404 33 "deis-builder"
    192.168.0.12 "GET /v2/hooks/key/71:3a:b0:18:e2:6c:41:18:4e:56:1e:fd:d2:49:97:66 HTTP/1.1" 404 33 "deis-builder"
    192.168.0.12 "GET /v2/hooks/key/b0:90:e5:dd:bf:1c:91:c0:83:dc:09:3a:53:7c:5d:07 HTTP/1.1" 404 33 "deis-builder"
    192.168.0.12 "GET /v2/hooks/key/22:c8:a9:c3:01:a0:17:31:a5:43:f2:70:4a:1c:55:f6 HTTP/1.1" 404 33 "deis-builder"

May this 404s or deployment overall timeout be a purpose for failing builder's liveness-check?

Bregor commented 7 years ago

Some addition info:

2017-01-17T11:11:58.616956833Z ERROR [ebay-mag-2]: (app::deploy): There was a problem while deploying v115 of ebay-mag-2-web. Additional information:
2017-01-17T11:11:58.617114079Z Readiness probe failed: Get http://192.168.48.12:5000/health: dial tcp 192.168.48.12:5000: getsockopt: connection refused
2017-01-17T11:11:58.617142643Z ERROR [ebay-mag-2]: (app::deploy): (app::deploy): There was a problem while deploying v115 of ebay-mag-2-web. Additional information:
2017-01-17T11:11:58.617157129Z Readiness probe failed: Get http://192.168.48.12:5000/health: dial tcp 192.168.48.12:5000: getsockopt: connection refused
2017-01-17T11:11:58.697508351Z ERROR:root:(app::deploy): (app::deploy): There was a problem while deploying v115 of ebay-mag-2-web. Additional information:
2017-01-17T11:11:58.697593251Z Readiness probe failed: Get http://192.168.48.12:5000/health: dial tcp 192.168.48.12:5000: getsockopt: connection refused
2017-01-17T11:11:58.697608908Z Traceback (most recent call last):
2017-01-17T11:11:58.697621238Z   File "/app/scheduler/__init__.py", line 270, in deploy
2017-01-17T11:11:58.697633205Z     namespace, name, image, entrypoint, command, **kwargs
2017-01-17T11:11:58.697644665Z   File "/app/scheduler/resources/deployment.py", line 138, in update
2017-01-17T11:11:58.697656298Z     self.wait_until_ready(namespace, name, **kwargs)
2017-01-17T11:11:58.697667225Z   File "/app/scheduler/resources/deployment.py", line 353, in wait_until_ready
2017-01-17T11:11:58.697684935Z     self.pod._handle_not_ready_pods(namespace, labels)
2017-01-17T11:11:58.697765951Z   File "/app/scheduler/resources/pod.py", line 720, in _handle_not_ready_pods
2017-01-17T11:11:58.697781435Z     raise KubeException(message)
2017-01-17T11:11:58.697795505Z scheduler.exceptions.KubeException: Readiness probe failed: Get http://192.168.48.12:5000/health: dial tcp 192.168.48.12:5000: getsockopt: connection refused
2017-01-17T11:11:58.697808171Z
2017-01-17T11:11:58.697818315Z The above exception was the direct cause of the following exception:
2017-01-17T11:11:58.697828868Z
2017-01-17T11:11:58.697841211Z Traceback (most recent call last):
2017-01-17T11:11:58.697860075Z   File "/app/api/models/app.py", line 493, in deploy
2017-01-17T11:11:58.697880775Z     async_run(tasks)
2017-01-17T11:11:58.697901071Z   File "/app/api/utils.py", line 169, in async_run
2017-01-17T11:11:58.697927378Z     raise error
2017-01-17T11:11:58.697939798Z   File "/usr/lib/python3.5/asyncio/tasks.py", line 241, in _step
2017-01-17T11:11:58.697951311Z     result = coro.throw(exc)
2017-01-17T11:11:58.697961688Z   File "/app/api/utils.py", line 182, in async_task
2017-01-17T11:11:58.697972558Z     yield from loop.run_in_executor(None, params)
2017-01-17T11:11:58.697983155Z   File "/usr/lib/python3.5/asyncio/futures.py", line 361, in __iter__
2017-01-17T11:11:58.697993875Z     yield self  # This tells Task to wait for completion.
2017-01-17T11:11:58.698005745Z   File "/usr/lib/python3.5/asyncio/tasks.py", line 296, in _wakeup
2017-01-17T11:11:58.698026335Z     future.result()
2017-01-17T11:11:58.698046131Z   File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
2017-01-17T11:11:58.698064605Z     raise self._exception
2017-01-17T11:11:58.698075908Z   File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
2017-01-17T11:11:58.698086948Z     result = self.fn(*self.args, **self.kwargs)
2017-01-17T11:11:58.698097535Z   File "/app/scheduler/__init__.py", line 276, in deploy
2017-01-17T11:11:58.698108305Z     ) from e
2017-01-17T11:11:58.698118831Z scheduler.exceptions.KubeException: There was a problem while deploying v115 of ebay-mag-2-web. Additional information:
2017-01-17T11:11:58.698129888Z Readiness probe failed: Get http://192.168.48.12:5000/health: dial tcp 192.168.48.12:5000: getsockopt: connection refused
2017-01-17T11:11:58.698141051Z
2017-01-17T11:11:58.698150915Z The above exception was the direct cause of the following exception:
2017-01-17T11:11:58.698161851Z
2017-01-17T11:11:58.698172055Z Traceback (most recent call last):
2017-01-17T11:11:58.698182398Z   File "/app/api/models/app.py", line 502, in deploy
2017-01-17T11:11:58.698195955Z     self.deploy(release.previous(), force_deploy=True, rollback_on_failure=False)
2017-01-17T11:11:58.698235185Z   File "/app/api/models/app.py", line 512, in deploy
2017-01-17T11:11:58.698248875Z     raise ServiceUnavailable(err) from e

Controller fails from time to time. Builder's liveness probe is controller-related.

Bregor commented 7 years ago

Just after builder restart:

$ git push deis master

2017-01-17T14:20:03.432674195Z Accepted connection.
2017-01-17T14:20:03.944641435Z Accepted connection.
2017-01-17T14:20:04.363952863Z ---> ---> ---> ---> panic: runtime error: slice bounds out of range
2017-01-17T14:20:04.364017975Z
2017-01-17T14:20:04.364028347Z goroutine 259 [running]:
2017-01-17T14:20:04.364035784Z panic(0x1170b00, 0xc4200180c0)
2017-01-17T14:20:04.364044873Z  /usr/local/go/src/runtime/panic.go:500 +0x1a1
2017-01-17T14:20:04.364053685Z github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dsaPrivateKey).Sign(0xc420358108, 0x1bbf2e0, 0xc4200155c0, 0xc42012da40, 0x14, 0x20, 0x0, 0x780, 0xc42012da40)
2017-01-17T14:20:04.364062346Z  /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/keys.go:379 +0x3bf
2017-01-17T14:20:04.364070577Z github.com/deis/builder/vendor/golang.org/x/crypto/ssh.signAndMarshal(0x1bc8860, 0xc420358108, 0x1bbf2e0, 0xc4200155c0, 0xc42012da40, 0x14, 0x20, 0x11030a0, 0x10caae3, 0x11a4180, ...)
2017-01-17T14:20:04.364077775Z  /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/server.go:157 +0x6d
2017-01-17T14:20:04.364086132Z github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dhGroup).Server(0xc420106c20, 0x7f0b30777cd8, 0xc420366ea0, 0x1bbf2e0, 0xc4200155c0, 0xc4205b8720, 0x1bc8860, 0xc420358108, 0x40def4, 0x12f8db4, ...)
2017-01-17T14:20:04.364093352Z  /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/kex.go:172 +0x4fb
2017-01-17T14:20:04.364100217Z github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).server(0xc420612c30, 0x1bc8820, 0xc420106c20, 0xc42016a600, 0xc4205b8720, 0x1, 0x0, 0xe3)
2017-01-17T14:20:04.364107165Z  /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:366 +0x1db
2017-01-17T14:20:04.364115152Z github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).enterKeyExchange(0xc420612c30, 0xc4202d0000, 0xd9, 0xd9, 0x0, 0x0)
2017-01-17T14:20:04.364122009Z  /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:337 +0x378
2017-01-17T14:20:04.364128869Z github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readOnePacket(0xc420612c30, 0x0, 0x0, 0x0, 0x0, 0x45c4c1)
2017-01-17T14:20:04.364135778Z  /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:166 +0xb4
2017-01-17T14:20:04.364142772Z github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readLoop(0xc420612c30)
2017-01-17T14:20:04.364149315Z  /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:133 +0x2f
2017-01-17T14:20:04.364187652Z created by github.com/deis/builder/vendor/golang.org/x/crypto/ssh.newServerTransport
2017-01-17T14:20:04.364196565Z  /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:108 +0x216
Bregor commented 7 years ago

Controller logs are empty at the moment, so this particular behaviour is just builder's

Bregor commented 7 years ago

Controller issue is here: https://github.com/deis/controller/issues/1204

vdice commented 7 years ago

Another interesting detail: Over the weekend, I checked out the v2.5.5 tag of builder, which is the version released with Workflow v2.8.0. These versions are believed to be the last not hitting this issue. With the builder deployment updated to use this image, there indeed are instances of this issue, though seemingly more rare:

deis-builder-1590797075-a5w9f            1/1       Running   33         3d

Error:

$ kd logs -p deis-builder-1590797075-a5w9f
2017/01/17 12:20:21 Running in debug mode
2017/01/17 12:20:21 Starting health check server on port 8092
2017/01/17 12:20:21 Starting deleted app cleaner
2017/01/17 12:20:21 Starting SSH server on 0.0.0.0:2223
Parsed host key /var/run/secrets/deis/builder/ssh/ssh-host-rsa-key.
Parsed host key /var/run/secrets/deis/builder/ssh/ssh-host-dsa-key.
Parsed host key /var/run/secrets/deis/builder/ssh/ssh-host-ecdsa-key.
Listening on 0.0.0.0:2223
Accepting new connections.
Accepted connection.
[DEBUG] [DEBUG] [DEBUG] ---> ---> ---> [ERROR] Failed handshake: ssh: invalid packet length, packet too large
Accepted connection.
---> panic: runtime error: slice bounds out of range

goroutine 12604 [running]:
panic(0x1170100, 0xc4200160c0)
    /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dsaPrivateKey).Sign(0xc420026a98, 0x1bbe4e0, 0xc4200137a0, 0xc420130b60, 0x14, 0x20, 0x0, 0x780, 0xc420130b60)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/keys.go:379 +0x3bf
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.signAndMarshal(0x1bc7aa0, 0xc420026a98, 0x1bbe4e0, 0xc4200137a0, 0xc420130b60, 0x14, 0x20, 0x1102640, 0x10ca083, 0x11a3800, ...)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/server.go:157 +0x6d
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dhGroup).Server(0xc4200ecc40, 0x7f4d821ee108, 0xc4204659e0, 0x1bbe4e0, 0xc4200137a0, 0xc420122420, 0x1bc7aa0, 0xc420026a98, 0x40def4, 0x12f84a8, ...)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/kex.go:172 +0x4fb
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).server(0xc4201174a0, 0x1bc7a60, 0xc4200ecc40, 0xc420118700, 0xc420122420, 0x1, 0x0, 0xe3)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:366 +0x1db
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).enterKeyExchange(0xc4201174a0, 0xc4200dc1c0, 0xd9, 0xd9, 0x0, 0x0)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:337 +0x378
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readOnePacket(0xc4201174a0, 0x0, 0x0, 0xc42048c5a0, 0xc42001f7a8, 0x95b892)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:166 +0xb4
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readLoop(0xc4201174a0)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:133 +0x2f
created by github.com/deis/builder/vendor/golang.org/x/crypto/ssh.newServerTransport
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:108 +0x216
lachie83 commented 7 years ago

+1 on this issue. Also witnessing the same behavior on k8s 1.4.7 builder image -- quay.io/deis/builder:v2.6.1

bacongobbler commented 7 years ago

@gabrtv pointed me to this commit, made only 15 hours ago: https://github.com/golang/crypto/commit/b82246307bd525fde15c1df976318003716bca68

I'll see if compiling builder with go-tip fixes this. If so, then we'll have to find a way to apply this patch to Go 1.7.4 in deis/docker-go-dev.

EDIT: golang.org/x/crypto is maintained separately from Go stdlib. myb

bacongobbler commented 7 years ago

If anyone's willing to try a fix, #464 looks promising.

bacongobbler commented 7 years ago

re-opening as #464 didn't resolve this.

stuszynski commented 7 years ago

I believe we encounter the same issue:

Environment:

Fail ratio: 8 times in last 24 hour, ~20 times a day avg from last 7days.

Log:

Accepted connection.
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> panic: runtime error: slice bounds out of range

goroutine 2651 [running]:
panic(0x1170b00, 0xc4200180c0)
    /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dsaPrivateKey).Sign(0xc42015c350, 0x1bbf2e0, 0xc420015740, 0xc420393300, 0x14, 0x20, 0x0, 0x780, 0xc420393300)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/keys.go:379 +0x3bf
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.signAndMarshal(0x1bc8860, 0xc42015c350, 0x1bbf2e0, 0xc420015740, 0xc420393300, 0x14, 0x20, 0xc42044d8eb, 0x12e44b7, 0x8, ...)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/server.go:157 +0x6d
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dhGroup).Server(0xc4200f8c20, 0x7f18f3edd968, 0xc4202146c0, 0x1bbf2e0, 0xc420015740, 0xc4203d9080, 0x1bc8860, 0xc42015c350, 0x40def4, 0x12f8db4, ...)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/kex.go:172 +0x4fb
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).server(0xc42068a000, 0x1bc8820, 0xc4200f8c20, 0xc4201edb80, 0xc4203d9080, 0x1, 0x0, 0xe3)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:366 +0x1db
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).enterKeyExchange(0xc42068a000, 0xc4204a01c0, 0xd9, 0xd9, 0x0, 0x0)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:337 +0x378
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readOnePacket(0xc42068a000, 0x0, 0x0, 0xc4203d8240, 0x7096dd, 0x180001)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:166 +0xb4
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).readLoop(0xc42068a000)
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:133 +0x2f
created by github.com/deis/builder/vendor/golang.org/x/crypto/ssh.newServerTransport
    /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:108 +0x216
monaka commented 7 years ago

I also got same on Workflow-2.11.0 / K8s v1.5.2+coreos.1 / Azure. Reproduced randomly.

Accepted connection.
Starting ssh authentication
Channel type: session

Received request of type auth-agent-req@openssh.com

Key='LC_ALL', Value=''

receiving git repo name: eclipse-che.git, operation: git-receive-pack, fingerprint: 54:0d:c3:3d:43:46:d8:1d:db:0e:ff:f5:02:1c:36:10, user: monaka
creating repo directory /home/git/eclipse-che.git
writing pre-receive hook under /home/git/eclipse-che.git
git-shell -c git-receive-pack 'eclipse-che.git'
Waiting for git-receive to run.
Waiting for deploy.
Accepted connection.
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> panic: runtime error: slice bounds out of range

goroutine 5062 [running]:
panic(0x11ac4a0, 0xc4200180c0)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dsaPrivateKey).Sign(0xc420316470, 0x1c22840, 0xc4200156e0, 0xc420708e80, 0x14, 0x20, 0x0, 0x7c0, 0xc420708e80)
        /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/keys.go:457 +0x3bf
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.signAndMarshal(0x1c2c000, 0xc420316470, 0x1c22840, 0xc4200156e0, 0xc420708e80, 0x14, 0x20, 0xc4205622eb, 0x132695d, 0x8, ...)
        /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/server.go:159 +0x6d
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*dhGroup).Server(0xc4200d1720, 0x7f1064176b80, 0xc42037c200, 0x1c22840, 0xc4200156e0, 0xc4201704e0, 0x1c2c000, 0xc420316470, 0x40df34, 0x133b4ad, ...)
        /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/kex.go:188 +0x511
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).server(0xc420540a00, 0x1c2bfc0, 0xc4200d1720, 0xc42009c380, 0xc4201704e0, 0x1, 0x0, 0x2)
        /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:537 +0x1db
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).enterKeyExchange(0xc420540a00, 0xc4204c41c0, 0xd9, 0xd9, 0xc4203e3e01, 0xc4206e77a0)
        /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:502 +0x37d
github.com/deis/builder/vendor/golang.org/x/crypto/ssh.(*handshakeTransport).kexLoop(0xc420540a00)
        /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:271 +0x1e9
created by github.com/deis/builder/vendor/golang.org/x/crypto/ssh.newServerTransport
        /go/src/github.com/deis/builder/vendor/golang.org/x/crypto/ssh/handshake.go:121 +0x227
vdice commented 7 years ago

Alas, it doesn't appear that any golang.org/x/crypto bumps have completely ameliorated this issue; bumping to next milestone.

yaroslav commented 7 years ago

Can reproduce as well; a huge problem for us at the moment 😞

vdice commented 7 years ago

Closed by https://github.com/deis/builder/pull/493

Included in Builder v2.9.0 release. You can update an existing install/helm release via

helm upgrade <release> workflow/workflow --set builder.docker_tag=v2.9.0 (-f values.yaml)

Note you may want to add --version <particular Workflow version> if necessary.

Or just patch the builder deploy via

kubectl -n deis set image deployment deis-builder deis-builder=quay.io/deis/builder:v2.9.0
felixbuenemann commented 7 years ago

I think you also need to upgrade to dockerbuilder v2.7.1 or things will break.

bacongobbler commented 7 years ago

The helm upgrade assumes v2.12.0 which has the latest version of dockerbuilder so yes.

felixbuenemann commented 7 years ago

Btw. I just did some debugging in golang.org/x/crypto/ssh and it appears the above mentioned bug occurs with 2048 Bit DSA keys, while the common key size for DSA is 1024 Bit. The error seems to be that the code hardcodes a subgroup size of 160 Bit (20 Bytes), but DSA 2048 Bit uses either 224 Bit or 256 Bit subgroups (my test key generated with openssl had a subgroup size of 256 Bit / 32 Byte).

I'll open an upstream issue and reference it back here.

Update: Looks like @vdice already reported the problem in golang/go#19424.

Based on the above it's safe to assume that the crashes happened whenever someone with a DSA key of 2048 Bit or more connected to the builder (which could also be bots/scanners and not actual users).

Update: It appears the root of the problem stems from the genPrivateKey template function of sprig which is used in the deis builder helm templates to generate the host keys and uses dsa.L2048N256 to generate a 2048 Bit DSA key with a 256 Bit subgroup instead of dsa.L1024N160 for a 1024 Bit key with a 160 Bit subgroup, which ist the only key size allowed by ssh-keygen -t dsa and the openssh server.

bacongobbler commented 7 years ago

We completely removed DSA keys from the builder some time ago as they're being discontinued, so this effectively has been fixed.

Bregor commented 7 years ago

@bacongobbler sure, this should be closed. Thanks for reminding.