cilium / tetragon

eBPF-based Security Observability and Runtime Enforcement
https://tetragon.io
Apache License 2.0
3.68k stars 373 forks source link

TestLabelsDemoApp failures (flake?) #1954

Open kkourt opened 10 months ago

kkourt commented 10 months ago

Hit a TestLabelsDemoApp failure (https://github.com/cilium/tetragon/actions/runs/7472506683/job/20334842561?pr=1948) in https://github.com/cilium/tetragon/pull/1948. Seems like a flake.

Details:

 I0110 09:10:55.961095   14366 dumpinfo.go:240] contacting metrics serveraddrhttp://localhost:2112/metrics
--- FAIL: TestLabelsDemoApp (30.00s)
    --- FAIL: TestLabelsDemoApp/Run_Event_Checks (10.01s)
        --- FAIL: TestLabelsDemoApp/Run_Event_Checks/Run_Event_Checks (10.01s)
            rpcchecker.go:171: 
                    Error Trace:    /home/runner/work/tetragon/tetragon/go/src/github.com/cilium/tetragon/tests/e2e/checker/rpcchecker.go:171
                                                /home/runner/work/tetragon/tetragon/go/src/github.com/cilium/tetragon/vendor/sigs.k8s.io/e2e-framework/pkg/env/env.go:422
                                                /home/runner/work/tetragon/tetragon/go/src/github.com/cilium/tetragon/vendor/sigs.k8s.io/e2e-framework/pkg/env/env.go:453
                    Error:          Received unexpected error:
                                    failed to get events after 10 tries
                    Test:           TestLabelsDemoApp/Run_Event_Checks/Run_Event_Checks
                    Messages:       checks should pass
    --- FAIL: TestLabelsDemoApp/Run_Workload (30.00s)
        --- FAIL: TestLabelsDemoApp/Run_Workload/Wait_for_Checker (30.00s)
            rpcchecker.go:107: 
                    Error Trace:    /home/runner/work/tetragon/tetragon/go/src/github.com/cilium/tetragon/tests/e2e/checker/rpcchecker.go:107
                                                /home/runner/work/tetragon/tetragon/go/src/github.com/cilium/tetragon/vendor/sigs.k8s.io/e2e-framework/pkg/env/env.go:422
                                                /home/runner/work/tetragon/tetragon/go/src/github.com/cilium/tetragon/vendor/sigs.k8s.io/e2e-framework/pkg/env/env.go:453
                    Error:          failed to wait for checker labelsEventChecker to start after 30s
                    Test:           TestLabelsDemoApp/Run_Workload/Wait_for_Checker
I0110 09:10:55.961222   14366 dumpinfo.go:299] contacting gops agentaddr127.0.0.1:8118
FAIL
E0110 09:10:55.961342   14366 dumpinfo.go:303] "failed to dump heap profile" err="failed to dump heap profile: dial tcp 127.0.0.1:8118: connect: connection refused" addr="127.0.0.1:8118"
coverage: [no statements]
I0110 09:10:55.961425   14366 dumpinfo.go:48] "Dumping test data" dir="/tmp/tetragon.e2e.TestLabelsDemoApp.2402988556"
I0110 09:10:55.961435   14366 dumpinfo.go:233] No checker info to dump
E0110 09:10:55.961716   14366 dumpinfo.go:244] "failed to contact metrics server" err="Get \"http://localhost:2112/metrics\": dial tcp [::1]:2112: connect: connection refused" addr="http://localhost:2112/metrics"
E0110 09:10:56.101977   14366 dumpinfo.go:71] "Failed to extract previous tetragon logs" err="failed to run kubectl logs -c tetragon -n kube-system tetragon-4x65q --previous: exit status 1"
I0110 09:10:56.892134   14366 cluster.go:165] Deleting temporary kind cluster tetragon-ci-5c01
I0110 09:10:56.892181   14366 kind.go:149] Destroying kind cluster tetragon-ci-5c01
I0110 09:10:58.147558   14366 kind.go:159] Removing kubeconfig file /tmp/kind-cluser-tetragon-ci-5c01-kubecfg1496165204
I0110 09:10:58.147660   14366 portforward.go:142] "Test ended, stopping portforward" pod="tetragon-4x65q" namespace="kube-system" ports=["54321:54321","2112:2112","8118:8118"]
FAIL    github.com/cilium/tetragon/tests/e2e/tests/labels   118.314s
ok      github.com/cilium/tetragon/tests/e2e/tests/policyfilter 127.217s    coverage: [no statements]
ok      github.com/cilium/tetragon/tests/e2e/tests/skeleton 379.648s    coverage: [no statements]
FAIL
make: *** [Makefile:251: e2e-test] Error 1
Error: Process completed with exit code 2.
mtardy commented 10 months ago

More typical timeout example. I think one solution would be to move away from these deployments that are flaky "by nature": they somehow fail to deploy on time even in an environment with enough resources. We have been talking in the past about moving away from those and maybe use https://github.com/GoogleCloudPlatform/microservices-demo, especially since now Tetragon is independent of Cilium for those tests.

willfindlay commented 10 months ago

Let's make a good first issue to do the migration to the microservices demo. I think it makes a lot of sense.

mtardy commented 10 months ago

Let's make a good first issue to do the migration to the microservices demo. I think it makes a lot of sense.

See https://github.com/cilium/tetragon/issues/1976.

lambdanis commented 6 months ago

This might be fixed by #2345. Let's keep an eye on Tetragon e2e tests for a couple of weeks, if it's stable then we can close the issue.

UPDATE: It seems the test is still flaky after switching to otel-demo app. It failed in #2417: https://github.com/cilium/tetragon/actions/runs/8966724879/attempts/1

Trung-DV commented 6 months ago

Hi @lambdanis https://github.com/cilium/tetragon/actions/runs/8966724879/job/24623050943#step:6:9683

time="2024-05-06T09:26:21Z" level=info msg="PROCESS_EXEC:894 => FINAL MATCH "
time="2024-05-06T09:26:21Z" level=info msg="DONE!"
--- FAIL: TestLabelsDemoApp (241.38s)
    --- FAIL: TestLabelsDemoApp/Run_Workload (118.15s)
        --- FAIL: TestLabelsDemoApp/Run_Workload/Run_Workload (118.10s)
            labels_test.go:53: failed to install demo app. run with `-args -v=4` for more context from helm: exit status 1
            labels_test.go:53: failed to install demo app. run with `-args -v=4` for more context from helm: exit status 1
            labels_test.go:53: failed to install demo app. run with `-args -v=4` for more context from helm: exit status 1
            labels_test.go:60: failed to install demo app after 3 tries
FAIL

The test seems successful, but the demo has failed to install. Maybe this is another flake test?

Btw, I'm wondering why we have to install and check labels in parallel instead of installing the demo app successfully and then running the label checker test? https://github.com/cilium/tetragon/blob/a3b867cb9e77fd1a305c89e4955c0a993e83d8cf/tests/e2e/tests/labels/labels_test.go#L97-L121

mtardy commented 6 months ago

Btw, I'm wondering why we have to install and check labels in parallel instead of installing the demo app successfully and then running the label checker test?

I'm not sure indeed. The only reason can be that it can potentially speed up the tests because technically the checker can finish before all the deployments are ready. If it can make debugging easier, maybe we could consider changing that. Do you have any memories on that @willfindlay?