cloudfoundry-incubator / kubecf

Cloud Foundry on Kubernetes
Apache License 2.0
115 stars 62 forks source link

'APP' and 'STG' logs fail to show up #1708

Open jbuns opened 3 years ago

jbuns commented 3 years ago

Describe the bug We’re currently facing issues with loggregator-bridge. When doing cf logs logs of type APP and STG fail to show up.

To Reproduce We've seen failing in two different scenarios.

scenario 1: we’ve got a long-running deployment of kubecf with eirini and noticed that after a while, APP and STG logs stop appearing during cf logs. I’ve traced it down to loggregator-bridge. The pod logs looks like:

{"level":"info","ts":1615161473.9439654,"caller":"kubeconfig/getter.go:53","msg":"Using in-cluster kube config"}
{"level":"info","ts":1615161473.9440942,"caller":"kubeconfig/checker.go:36","msg":"Checking kube config"}
Error:  unexpected EOF
Error:  unexpected EOF
Received non-pod object in watcher channel

scenario 2: after a fresh installation of kubecf+eirini on OpenShift 4.6 (k8s version 1.19), the cf logs fail to appear and the problem is the exact same as above.

Expected behavior When doing cf logs I should be able to also see APP and STG logs.

Environment KubeCF version: 2.7.12 Eirini version: 1.8 Kubernetes: 1.19

Additional context This was tested on OpenShift 4.4 and 4.6

jbuns commented 3 years ago

Tested also on AKS and seeing the same problem:

$ k logs loggregator-bridge-59f5cb64bc-9scbb -n kubecf
{"level":"info","ts":1615563276.4147344,"caller":"kubeconfig/getter.go:53","msg":"Using in-cluster kube config"}
{"level":"info","ts":1615563276.414798,"caller":"kubeconfig/checker.go:36","msg":"Checking kube config"}
Received non-pod object in watcher channel
Error:  unexpected EOF
jandubois commented 3 years ago

@mudler Any ideas what this might be / where to look next?

jbuns commented 3 years ago

I've turned on DEBUG logging for loggregator-bridge and this is the error I'm seeing:

Starting Loggregator
{"level":"info","ts":1615410279.0113866,"caller":"kubeconfig/getter.go:53","msg":"Using in-cluster kube config"}
{"level":"info","ts":1615410279.0114636,"caller":"kubeconfig/checker.go:36","msg":"Checking kube config"}
Received event:  {ERROR &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:Failure,Message:too old resource version: 43522014 (43524698),Reason:Expired,Details:nil,Code:410,}}
Received non-pod object in watcher channel

In the code, I can see that the failure is happening here: https://github.com/cloudfoundry-incubator/eirini-loggregator-bridge/blob/master/podwatcher/podwatcher.go#L293-L306

@mudler / @jandubois any suggestions on how we can try to fix this?

jandubois commented 3 years ago

@jbuns Sorry, I know nothing about the eirini-loggregator-bridge, and have no time to learn about it.

Let's see if @mudler can give you hints next week; this week has been Hackweek at SUSE, so everyone has been working on other stuff... (FWIW, I spend half a day of my hackweek time yesterday on getting Eirini-1.8 to continue to work with the latest cf-deployment, so we don't have to drop it (yet) for the kubecf-2.8 releases).

mudler commented 3 years ago

It looks like we are receiving old events in the channel - this reminds me the work done in EiriniX https://github.com/cloudfoundry-incubator/eirinix/pull/38 - is the loggregator-bridge using latest EiriniX including that fix? Otherwise, the alternative is specifying manually a ResourceVersion to start watch on.

From the error message, it looks the watcher is starting to listen on events which are old and not there anymore - while the above PR was meant to fetch the latest ResourceVersion during start to fix exactly that issue

jbuns commented 3 years ago

@mudler loggregator-bridge is using eirinix v0.3.1 https://github.com/cloudfoundry-incubator/eirini-loggregator-bridge/blob/master/go.mod#L4

so I'm assuming that it's got the fix you've mentioned since https://github.com/cloudfoundry-incubator/eirinix/pull/38 was merged since v0.2.0: https://github.com/cloudfoundry-incubator/eirinix/compare/v0.2.0...master

Does that mean that the manager in eirinix is the one that's failing? Only difference I can see between the PR above and what's in the code now is this line: https://github.com/cloudfoundry-incubator/eirinix/blob/master/manager.go#L298

jbuns commented 3 years ago

The status Message:too old resource version seems to be an expected behaviour according to kubernetes: https://github.com/kubernetes/kubernetes/issues/22024

It looks like podwatcher needs to be updated in order to handle this, rather than erroring out.

@mudler any preference on how I should fix this or should I just come up with the fix and it can be reviewed in a PR?