cloudfoundry-incubator / kubecf

Cloud Foundry on Kubernetes
Apache License 2.0
115 stars 62 forks source link

Eirini brain tests are failing #1371

Closed satadruroy closed 4 years ago

satadruroy commented 4 years ago

KubeCF 2.5/Eirini/Ingress Controller EKS/k8s 1.17

Tests complete: 7 Passed, 2 Skipped, 5 Failed                                                                                          │
│   Skipped tests:                                                                                                                       │
│     017_syslog_forwarding_test.rb                                                                                                      │
│     011_nfspersi_test.rb                                                                                                               │
│   Failed tests:                                                                                                                        │
│     010_insecure_registry_test.rb with exit code 1                                                                                     │
│     004_tcprouting_test.rb with exit code 1                                                                                            │
│     005_metron_test.rb with exit code 1                                                                                                │
│     018_autoscaler_test.rb with exit code 1                                                                                            │
│     013_credhub_test.rb with exit code 1 

Autoscaler failures are intermittent but insecure_registry, tcp_routing and metron test failures were also observed on AKS with Eirini.

andreas-kupries commented 4 years ago

@satadruroy Do you have the full test output somewhere ? Please attach it to the ticket.

The excerpt in the description essentially misses all the details needed for proper debugging. (IIRC when a test fails the brain runner dumps the entire log for that test into the final output, to enable post-mortem analysis).

andreas-kupries commented 4 years ago

EBRAIN.txt

Brain logs from the referenced build. Not the full logs from concourse, just the part dealing with the brains tests. It has the (expected) 4, 5, and 10 failing (tcp-routing, metron, insecure-registry).

andreas-kupries commented 4 years ago

Wrt 004/tcp_routing the main error reported is

+ curl --fail -s -o /dev/null tcp-route-node-env-7516.ci-aks-9fec0ee3133b908c.susecap.net
+ curl tcp.ci-aks-9fec0ee3133b908c.susecap.net:20005
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to tcp.ci-aks-9fec0ee3133b908c.susecap.net port 20005: Connection refused
Command exited with 7

This points towards bad/missing setup of proper tcp routing for eirini ?

andreas-kupries commented 4 years ago

005/metron - The main part of the test pushes an application, and then inspects the cf logs --recent of that app for a keyword.

+ cf logs node-env-9c84 --recent | grep -i Downloading

The push looks to be working, the keyword however is not found. It might be that under eirini the logs are different enough to not have the expected keyword, and something different instead.

andreas-kupries commented 4 years ago

That should be relatively easy to check outside of brain tests.

andreas-kupries commented 4 years ago

nodeenv-recent-eirini.log Indeed, no downloading to be found in this log. There is a download, unclear if that is similar to the sought line. The app looks properly staged, marked as running, nothing bogus in the log.

Next, trial a diego cluster, for comparison of the logs. (Strongly suspect a diego/eirini difference here).

andreas-kupries commented 4 years ago

Tn diego the primary origin for Downloading in the diego logs are all the buildpacks, and later Downloading app package. That is all not really visible in Eirini.

nodeenv-recent-diego.log

An early entry in both logs looks to be Creating build for app with guid .... Would recommend using that.

andreas-kupries commented 4 years ago

Now looking at the 010/insecure-registry, this looks to be same issue as for 004/tcp-routing, i.e

curl: (7) Failed to connect to tcp.ci-aks-9fec0ee3133b908c.susecap.net port 20005: Connection refused
Command exited with 7

@satadruroy @viovanov : Do we have proper support for TCP routing in eirini ?

mudler commented 4 years ago

  Skipped tests:
    018_autoscaler_test.rb
    011_nfspersi_test.rb
    017_syslog_forwarding_test.rb

  Failed tests:
    005_metron_test.rb with exit code 1
    004_tcprouting_test.rb with exit code 1
    010_insecure_registry_test.rb with exit code 1
    007_buildpacks_test.rb with exit code 1

Maybe https://github.com/cloudfoundry-incubator/kubecf/pull/1398 will help with credhub tests (I've deployed from https://github.com/cloudfoundry-incubator/kubecf/tree/edg/persi-brains )

andreas-kupries commented 4 years ago

PR for this ticket started, see SUSE/brain-tests-release/pull/20. Test 005 for now, only.

andreas-kupries commented 4 years ago

Information wrt tcp routing, from the :rocket: ...

@f0rmiga writes: @gaktive @viovanov There's no way to do TCP routing with Eirini right now. The responsible for emitting the TCP route to the routing-api is the route_emitter that comes with Diego. It's implemented in https://github.com/cloudfoundry/route-emitter/blob/2d1c1653c62944048c3cec1243f97d9bf6232c56/emitter/routing_api_emitter.go. Eirini does have a route emitter, but it doesn't implement the routing-api emitter: https://github.com/cloudfoundry-incubator/eirini/tree/0e9faaaa31778c6cb84828193a5af9b6b5e511d0/route. For a bit more context, the HTTP routes are emitted to gorouter directly, while the TCP routes are emitted to routing-api. This is why we can actually disable routing-api when TCP routing is not needed. Thinking even more, with Eirini not supporting TCP routing, we can disable routing-api and tcp-router completely. A longer-term solution would be to actually implement the routing-api emitter in Eirini. @jimmykarily Any ideas?

@troytop writes: @viovanov @f0rmiga @gaktive missing tcp routing in eirini is not a blocker for CAP 2.1

andreas-kupries commented 4 years ago

PR SUSE/brain-tests-release/pull/20 Merged. Watching for v0.0.15 build now.

jimmykarily commented 4 years ago

@andreas-kupries I had to do the same investigation for the CATs that were failing on tcp routing. The result was this story on PT: https://www.pivotaltracker.com/story/show/174033038 . As you see it's not in the backlog yet.

f0rmiga commented 4 years ago

Making a correction on my comment, we should not remove routing-api just because tcp-router doesn't work with it. It keeps track of the routes - not just the TCP ones. If the gorouter misses an update from nats, it can sync with the routing-api to keep the correct state.

andreas-kupries commented 4 years ago

Right now my local changes disable only 004 and 010. I.e tcp-routing test, and insecure-registry test.

andreas-kupries commented 4 years ago

New PR: #1468 (Just the BTR bump).

andreas-kupries commented 4 years ago

New PR: #1469 (Changed handling (defaults) of routing_api.enabled).