d2iq-archive / kubernetes-mesos

A Kubernetes Framework for Apache Mesos
636 stars 92 forks source link

Oinker-Go scheduling frequently fails with ExitCode:2 #626

Open karlkfi opened 8 years ago

karlkfi commented 8 years ago

Deploy k8s, etcd (may not be critical to duplication) & cassandra:

dcos config prepend package.sources https://github.com/mesosphere/multiverse/archive/version-1.x.zip
dcos package update --validate
dcos package install cassandra
dcos package install etcd
cat >/tmp/options.json <<EOF
{
  "kubernetes": {
    "etcd-mesos-framework-name": "etcd"
  }
}
EOF
dcos package install --options=/tmp/options.json kubernetes

Deploy OinkerGo:

dcos kubectl create -f ~/go/src/github.com/karlkfi/oinker-go/kubernetes.yaml

Check the status of the pods:

$ dcos kubectl get pod -l=app=oinker
NAME              READY     STATUS       RESTARTS   AGE
oinker-go-2qpwc   0/1       Pending      0          14s
oinker-go-75cjf   0/1       Pending      0          14s
oinker-go-7url5   0/1       Pending      0          14s
oinker-go-7xanx   0/1       Pending      0          14s
oinker-go-8ur4h   0/1       Pending      0          14s
oinker-go-dal6r   0/1       Pending      0          14s
oinker-go-eu2xw   0/1       Pending      0          14s
oinker-go-kk9yr   1/1       Running      0          43s
oinker-go-m5yc8   0/1       Running      0          14s
oinker-go-rxflp   0/1       Running      0          14s
oinker-go-sukc1   0/1       Running      0          14s
oinker-go-tei77   0/1       Running      0          14s
oinker-go-uyged   1/1       Running      0          43s
oinker-go-v9r41   0/1       Running      0          14s
oinker-go-vyu2h   0/1       ExitCode:2   1          14s
oinker-go-ygmxz   0/1       Pending      0          14s
oinker-go-yni2u   1/1       Running      0          43s
oinker-go-zfb72   0/1       Pending      0          14s

Wait a few seconds and try again:

$ dcos kubectl get pod -l=app=oinker
NAME              READY     STATUS       RESTARTS   AGE
oinker-go-2qpwc   0/1       Pending      0          29s
oinker-go-75cjf   1/1       Running      0          29s
oinker-go-7url5   0/1       Running      1          29s
oinker-go-7xanx   0/1       Running      0          29s
oinker-go-8ur4h   0/1       ExitCode:2   1          29s
oinker-go-dal6r   0/1       ExitCode:2   0          29s
oinker-go-eu2xw   0/1       Running      0          29s
oinker-go-kk9yr   1/1       Running      0          58s
oinker-go-m5yc8   0/1       ExitCode:2   2          29s
oinker-go-rxflp   0/1       ExitCode:2   2          29s
oinker-go-sukc1   0/1       ExitCode:2   2          29s
oinker-go-tei77   0/1       ExitCode:2   1          29s
oinker-go-uyged   1/1       Running      0          58s
oinker-go-v9r41   1/1       Running      0          29s
oinker-go-vyu2h   0/1       Running      3          29s
oinker-go-ygmxz   0/1       Running      0          29s
oinker-go-yni2u   1/1       Running      0          58s
oinker-go-zfb72   0/1       ExitCode:2   1          29s

Events on one of the failing pods:

Events:
  FirstSeen             LastSeen            Count   From            SubobjectPath               Reason          Message
  Tue, 17 Nov 2015 10:34:13 -0800   Tue, 17 Nov 2015 10:34:15 -0800 3   {scheduler }                            failedScheduling    Error scheduling: No suitable offers for pod/task

Eventually the pods all get scheduled, but it can take 0-12 restarts.

$ dcos kubectl get pod -l=app=oinker
NAME              READY     STATUS    RESTARTS   AGE
oinker-go-2qpwc   1/1       Running   2          9m
oinker-go-75cjf   1/1       Running   0          9m
oinker-go-7url5   1/1       Running   2          9m
oinker-go-7xanx   1/1       Running   2          9m
oinker-go-8ur4h   1/1       Running   2          9m
oinker-go-dal6r   1/1       Running   2          9m
oinker-go-eu2xw   1/1       Running   2          9m
oinker-go-kk9yr   1/1       Running   0          9m
oinker-go-m5yc8   1/1       Running   2          9m
oinker-go-rxflp   1/1       Running   2          9m
oinker-go-sukc1   1/1       Running   2          9m
oinker-go-tei77   1/1       Running   2          9m
oinker-go-uyged   1/1       Running   0          9m
oinker-go-v9r41   1/1       Running   0          9m
oinker-go-vyu2h   0/1       API error (500): Cannot start container 3b838db332dbb27dc67758126f6f77251ecafb42d8d7015d36a3e9e33ceeee20: cannot join network of a non running container: 8b3bdc9ad26f89df56f78df1a2af8b2736058c407705504832243834c8c5d854
                  8         9m
oinker-go-ygmxz   1/1       Running   2         9m
oinker-go-yni2u   1/1       Running   0         9m
oinker-go-zfb72   1/1       Running   2         9m

(I think this cannot join network of a non running container is a different bug.)

sttts commented 8 years ago

Isn't this "Exitcode:2" just your binary terminating with 2?

karlkfi commented 8 years ago

It should be, but it's not. See the event: Error scheduling: No suitable offers for pod/task.

sttts commented 8 years ago

That message is normal when a pod comes in without an available offer. It's not nice, but has no influence on scheduling. I believe that the pod has been started when you see "Exitcode:2".

sttts commented 8 years ago

I tried again without a working cassandra. Pods come up and terminate, again and again, giving Exit Code 2:

lastState:
      terminated:
        containerID: docker://009ca04414623507c155cfb88f1d560620d985641f797ccbb6dfa389c4b23a8a
        exitCode: 2
        finishedAt: 2015-11-23T15:53:07Z
        startedAt: 2015-11-23T15:53:07Z

The symptoms have nothing to do with scheduling. The task fails (maybe because of other issues, DNS? Cassandra connection broken?)

sttts commented 8 years ago

Here is a log of one of the failing containers:

core@ip-10-0-2-61 ~ $ docker logs 1d96a7742b0a
time="2015-11-23T15:54:47Z" level=info msg="Flags: &{address:0xc20802ab30 cassandraAddr:0xc20802aaf0 cassandraRepl:0xc20802ab10}"
time="2015-11-23T15:54:47Z" level=info msg="Creating cql session"
time="2015-11-23T15:54:47Z" level=error msg="Creating CQL Session: no connections were made when creating the session"
time="2015-11-23T15:54:47Z" level=info msg="Creating keyspace oinker (replication_factor: 3)"
time="2015-11-23T15:54:47Z" level=error msg="Cannot finalize cql session, not initialized"
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x4badfd]

goroutine 1 [running]:
github.com/gocql/gocql.(*policyConnPool).Pick(0x0, 0xc2080348f0, 0x3)
    /go/src/github.com/karlkfi/oinker-go/Godeps/_workspace/src/github.com/gocql/gocql/connectionpool.go:175 +0x13d
github.com/gocql/gocql.(*Session).executeQuery(0xc208060000, 0xc2080348f0, 0x0)
    /go/src/github.com/karlkfi/oinker-go/Godeps/_workspace/src/github.com/gocql/gocql/session.go:221 +0xf8
github.com/gocql/gocql.(*Query).Iter(0xc2080348f0, 0xc2080348f0)
    /go/src/github.com/karlkfi/oinker-go/Godeps/_workspace/src/github.com/gocql/gocql/session.go:707 +0xce
github.com/gocql/gocql.(*Query).Exec(0xc2080348f0, 0x0, 0x0)
    /go/src/github.com/karlkfi/oinker-go/Godeps/_workspace/src/github.com/gocql/gocql/session.go:689 +0x3a
github.com/karlkfi/oinker-go/service.(*CQLOinkRepo).Initialize(0xc208084360)
    /go/src/github.com/karlkfi/oinker-go/service/cql_oink_repo.go:45 +0x3a7
github.com/karlkfi/inject.(*definition).Resolve(0xc20800bf20, 0x7f4071e8b578, 0xc208032040, 0x0, 0x0, 0x0)
    /go/src/github.com/karlkfi/oinker-go/Godeps/_workspace/src/github.com/karlkfi/inject/definition.go:52 +0x1b6
github.com/karlkfi/inject.(*graph).ResolveByAssignableType(0xc208032040, 0x7f4071e86b28, 0x7d25c0, 0x0, 0x0, 0x0)
    /go/src/github.com/karlkfi/oinker-go/Godeps/_workspace/src/github.com/karlkfi/inject/graph.go:81 +0x1d9
github.com/karlkfi/inject.ExtractAssignable(0x7f4071e8b578, 0xc208032040, 0x703ba0, 0xc20802ab40, 0x0, 0x0, 0x0)
    /go/src/github.com/karlkfi/oinker-go/Godeps/_workspace/src/github.com/karlkfi/inject/extract.go:39 +0x27e
main.main()
    /go/src/github.com/karlkfi/oinker-go/main.go:32 +0x2ab

goroutine 5 [syscall]:
os/signal.loop()
    /usr/lib/go/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
    /usr/lib/go/src/os/signal/signal_unix.go:27 +0x35

goroutine 8 [runnable]:
github.com/gocql/gocql.(*hostConnPool).fillingStopped(0xc20805e000)
    /go/src/github.com/karlkfi/oinker-go/Godeps/_workspace/src/github.com/gocql/gocql/connectionpool.go:384
created by github.com/gocql/gocql.(*hostConnPool).fill
    /go/src/github.com/karlkfi/oinker-go/Godeps/_workspace/src/github.com/gocql/gocql/connectionpool.go:338 +0x226

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
    /usr/lib/go/src/runtime/asm_amd64.s:2232 +0x1

goroutine 9 [runnable]:
github.com/gocql/gocql.(*hostConnPool).drain(0xc20805e000)
    /go/src/github.com/karlkfi/oinker-go/Godeps/_workspace/src/github.com/gocql/gocql/connectionpool.go:457
created by github.com/gocql/gocql.(*hostConnPool).Close
    /go/src/github.com/karlkfi/oinker-go/Godeps/_workspace/src/github.com/gocql/gocql/connectionpool.go:286 +0xa0
sttts commented 8 years ago

It can be that the issue of bad scheduling is actually related to https://github.com/mesosphere/kubernetes-mesos/issues/636 which is fixed now in latest 0.7 RC packages.

Removing from the milestone for the moment and reassigning to @karlkfi for further tracking.

karlkfi commented 8 years ago

Downgrading priority, since it's just an Oinker bug.