Oinker-Go scheduling frequently fails with ExitCode:2

karlkfi commented 8 years ago

Deploy k8s, etcd (may not be critical to duplication) & cassandra:

dcos config prepend package.sources https://github.com/mesosphere/multiverse/archive/version-1.x.zip
dcos package update --validate
dcos package install cassandra
dcos package install etcd
cat >/tmp/options.json <<EOF
{
  "kubernetes": {
    "etcd-mesos-framework-name": "etcd"
  }
}
EOF
dcos package install --options=/tmp/options.json kubernetes

Deploy OinkerGo:

dcos kubectl create -f ~/go/src/github.com/karlkfi/oinker-go/kubernetes.yaml

Check the status of the pods:

$ dcos kubectl get pod -l=app=oinker
NAME              READY     STATUS       RESTARTS   AGE
oinker-go-2qpwc   0/1       Pending      0          14s
oinker-go-75cjf   0/1       Pending      0          14s
oinker-go-7url5   0/1       Pending      0          14s
oinker-go-7xanx   0/1       Pending      0          14s
oinker-go-8ur4h   0/1       Pending      0          14s
oinker-go-dal6r   0/1       Pending      0          14s
oinker-go-eu2xw   0/1       Pending      0          14s
oinker-go-kk9yr   1/1       Running      0          43s
oinker-go-m5yc8   0/1       Running      0          14s
oinker-go-rxflp   0/1       Running      0          14s
oinker-go-sukc1   0/1       Running      0          14s
oinker-go-tei77   0/1       Running      0          14s
oinker-go-uyged   1/1       Running      0          43s
oinker-go-v9r41   0/1       Running      0          14s
oinker-go-vyu2h   0/1       ExitCode:2   1          14s
oinker-go-ygmxz   0/1       Pending      0          14s
oinker-go-yni2u   1/1       Running      0          43s
oinker-go-zfb72   0/1       Pending      0          14s

Wait a few seconds and try again:

$ dcos kubectl get pod -l=app=oinker
NAME              READY     STATUS       RESTARTS   AGE
oinker-go-2qpwc   0/1       Pending      0          29s
oinker-go-75cjf   1/1       Running      0          29s
oinker-go-7url5   0/1       Running      1          29s
oinker-go-7xanx   0/1       Running      0          29s
oinker-go-8ur4h   0/1       ExitCode:2   1          29s
oinker-go-dal6r   0/1       ExitCode:2   0          29s
oinker-go-eu2xw   0/1       Running      0          29s
oinker-go-kk9yr   1/1       Running      0          58s
oinker-go-m5yc8   0/1       ExitCode:2   2          29s
oinker-go-rxflp   0/1       ExitCode:2   2          29s
oinker-go-sukc1   0/1       ExitCode:2   2          29s
oinker-go-tei77   0/1       ExitCode:2   1          29s
oinker-go-uyged   1/1       Running      0          58s
oinker-go-v9r41   1/1       Running      0          29s
oinker-go-vyu2h   0/1       Running      3          29s
oinker-go-ygmxz   0/1       Running      0          29s
oinker-go-yni2u   1/1       Running      0          58s
oinker-go-zfb72   0/1       ExitCode:2   1          29s

Events on one of the failing pods:

Events:
  FirstSeen             LastSeen            Count   From            SubobjectPath               Reason          Message
  Tue, 17 Nov 2015 10:34:13 -0800   Tue, 17 Nov 2015 10:34:15 -0800 3   {scheduler }                            failedScheduling    Error scheduling: No suitable offers for pod/task

Eventually the pods all get scheduled, but it can take 0-12 restarts.

$ dcos kubectl get pod -l=app=oinker
NAME              READY     STATUS    RESTARTS   AGE
oinker-go-2qpwc   1/1       Running   2          9m
oinker-go-75cjf   1/1       Running   0          9m
oinker-go-7url5   1/1       Running   2          9m
oinker-go-7xanx   1/1       Running   2          9m
oinker-go-8ur4h   1/1       Running   2          9m
oinker-go-dal6r   1/1       Running   2          9m
oinker-go-eu2xw   1/1       Running   2          9m
oinker-go-kk9yr   1/1       Running   0          9m
oinker-go-m5yc8   1/1       Running   2          9m
oinker-go-rxflp   1/1       Running   2          9m
oinker-go-sukc1   1/1       Running   2          9m
oinker-go-tei77   1/1       Running   2          9m
oinker-go-uyged   1/1       Running   0          9m
oinker-go-v9r41   1/1       Running   0          9m
oinker-go-vyu2h   0/1       API error (500): Cannot start container 3b838db332dbb27dc67758126f6f77251ecafb42d8d7015d36a3e9e33ceeee20: cannot join network of a non running container: 8b3bdc9ad26f89df56f78df1a2af8b2736058c407705504832243834c8c5d854
                  8         9m
oinker-go-ygmxz   1/1       Running   2         9m
oinker-go-yni2u   1/1       Running   0         9m
oinker-go-zfb72   1/1       Running   2         9m

(I think this cannot join network of a non running container is a different bug.)