RedHatQE / firewatch

React to OpenShift CI test failures
Apache License 2.0
5 stars 11 forks source link

Failed job, incorrectly updated JIRA tickets with passed test results. #159

Open vi-patel opened 6 months ago

vi-patel commented 6 months ago

Failed job, incorrectly updated JIRA tickets with passed test results.

The following is a test that fails on pod creation, and doesn't run the set of tests: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-knative-serverless-operator-main-ocp4.15-lp-interop-operator-e2e-interop-aws-ocp415/1759589706720350208/artifacts/operator-e2e-interop-aws-ocp415/firewatch-report-issues/build-log.txt

Prow correctly marks the job as failed. However, firewatch incorrectly reports this job as a success (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-knative-serverless-operator-main-ocp4.15-lp-interop-operator-e2e-interop-aws-ocp415/1759589706720350208/artifacts/operator-e2e-interop-aws-ocp415/firewatch-report-issues/build-log.txt) updating Jira tickets with passing job notifications and job labels to other linked Jira tickets.

calebevans commented 6 months ago

After some investigation, I have found the issue... It seems like the pod that failed operator-e2e didn't come up, but the finished.json file was not updated with the failure until after the firewatch execution occurred. This seems to be the way OpenShift CI or Prow operates. Unfortunately, in its current state I'm not sure this sort of error can be caught by firewatch, and it is hard to test this as it is an edge case. It seems the order of operation here follows something like this:

  1. Container fails to come up
  2. Prow writes the finished.json file as a success in the operator-e2e step
  3. Prow starts the "post" steps which contains firewatch-report-issues (ref that executes firewatch)
  4. Prow updates the finished.json file to reflect a failure after firewatch has already run

firewatch executed and finished at 16:56:32 image

operator-e2e files (finished.json) updated at 17:22:23 image

With the files updated, I can run firewatch and create the correct bug in stage: https://issues.stage.redhat.com/browse/LPTOCPCI-1145

My current thoughts on this are - I don't think we can resolve this without running firewatch as a service outside of OpenShift CI. Would appreciate some ideas to resolve this behavior.

vi-patel commented 6 months ago

Bug filed against DPTP, adding for reference: https://issues.redhat.com/browse/DPTP-3902