deis / workflow

The open source PaaS for Kubernetes.
https://deis.com/workflow/
MIT License
1.31k stars 181 forks source link

Fluentd pod crashing on Azure Container Service #847

Open sbulman opened 7 years ago

sbulman commented 7 years ago

Hi All,

I'm following the instructions to set up Deis on Azure Container Service. One of the deis-logger-fluentd pods is crashing with the following log.

2017-08-05 07:21:26 +0000 [info]: reading config file path="/opt/fluentd/conf/fluentd.conf" 2017-08-05 07:22:27 +0000 [error]: config error file="/opt/fluentd/conf/fluentd.conf" error_class=Fluent::ConfigError error="Invalid Kubernetes API v1 endpoint https://10.0.0.1:443: Timed out connecting to server"

Any ideas?

Thanks.

sbulman commented 7 years ago

A bit more info. I created the ACS cluster with 1 agent. The fluentd pod that is crashing is on the master node. The pod running on the agent appears to be working fine.

ghost commented 6 years ago

We're facing the same issue, same symptoms and circumstances as @sbulman. The fluentd logger pod continually crashes on the master node on Azure ACS.

bacongobbler commented 6 years ago

There should not be a fluentd pod running on the master node. There was an open ticket on DaemonSet pods being accidentally scheduled on the kubernetes master node that was eventually solved upstream.

More background context in this ticket, which was resolved in Kubernetes 1.5.0+ via https://github.com/kubernetes/kubernetes/pull/35526.

ghost commented 6 years ago

Ok, thanks @bacongobbler for the context. It still appears to be an issue though on ACS today. Any thoughts much appreciated!

The fluentd logger pod event for the master node indicates the following error:

Error syncing pod, skipping: failed to "StartContainer" for "deis-logger-fluentd" with CrashLoopBackOff: "Back-off 10s restarting failed container=deis-logger-fluentd pod=deis-logger-fluentd-swjnl_deis

K8S versions (client and Azure Container Service):

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.6", GitCommit:"4bc5e7f9a6c25dc4c03d4d656f2cefd21540e28c", GitTreeState:"clean", BuildDate:"2017-09-14T06:55:55Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

Deis version 2.18.0

Fluentd pod is definitely running on the master node on ACS as denoted by the event logs, in this case being created by: k8s-master-47933ef9-0

monaka commented 6 years ago

I also got same issue on my K8s/CoreOS. Not on ACS but might be same root cause.

In my case, it was fixed by adding the option --register-with-taints=node-role.kubernetes.io/master=true:NoSchedule to hyperkube.

The unschedulable field of a node is not respected by the DaemonSet controller.

Cryptophobia commented 6 years ago

This issue was moved to teamhephy/workflow#6