Closed danail-branekov closed 1 year ago
This flake is slightly similar to https://github.com/cloudfoundry/korifi/issues/1310 since the api shim returns a 503
on org createion. The main difference here is that this happens 100% on async org creation.
Like #1310 there is no evidence of very high load and no information about potential api shim restarts
I think this might be the same root cause as #1310
I don't think that's what's really going on, since we looked at gcloud console and found no evidence of the shim being restarted or crashed (crash counts were zero). I think the DPanidOnBugs setting is somehow not switched on in our shim. As a matter of facts I was able to observe the poorly formatted log message right after the dpanic. And we observed this flake after the dpanic fix, so we are after something else here...
We found these interesting logs in the envoy proxy logs:
" "-"
[2022-07-08T09:33:15.932Z] "GET /v3/droplets/670c162a-2617-45af-8841-8daa8105cca9 HTTP/2" 503 UF 0 87 2 - 34.140.146.235" "go-resty/2.7.0 (https://github.com/go-resty/resty)" "0814f7ea-bb18-4386-8f44-b3f2a8c4fa8c" "cf.pr-e2e.korifi.cf-app.com" "10.100.0.60:9000"
[2022-07-08T09:33:15.927Z] "POST /v3/spaces HTTP/2" 503 UF 129 87 8 - "34.140.146.235" "go-resty/2.7.0 (https://github.com/go-resty/resty)" "be35f2dc-6260-4a6b-81ae-1ec67eb044a4" "cf.pr-e2e.korifi.cf-app.com" "10.100.0.60:9000"
[2022-07-08T09:33:15.926Z] "POST /v3/spaces HTTP/2" 503 UF 129 87 12 - "34.140.146.235" "go-resty/2.7.0 (https://github.com/go-resty/resty)" "aae629aa-ac7f-4df7-8bbd-4b1a7e765c02" "cf.pr-e2e.korifi.cf-app.com" "10.100.0.60:9000"
[2022-07-08T09:33:15.924Z] "POST /v3/spaces HTTP/2" 503 UF 129 87 15 - "34.140.146.235" "go-resty/2.7.0 (https://github.com/go-resty/resty)" "c7104b65-585c-4123-bba2-129d066bb190" "cf.pr-e2e.korifi.cf-app.com" "10.100.0.60:9000"
Thie timestamps on these correlate nicely to the time of one of the last flakes: https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-pr/builds/209.2#L62c50f4e:165:169
There are three EOF stderr logs in this build at the same time https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-pr/builds/209.2#L62c50f4e:74:76 which are probably related to the envoy 503s.
This means that the envoy proxy is closing those connections for some reason. It's worth noting that these requests happen in parallel so we might be hitting some limits of the server. Unfortunately we haven't been able to correlate one of these failures with particularly high load of the node.
We have added a task catting the envoy logs on test failure in order to find out if we will always get these logs.
Closing due to inactivity. Let's reopen this issue or create a new one if we encounter the flake again
We have been seeing frequent flakes in the list spaces E2E test, both locally and on CI
https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/365 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/370 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/345 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/343 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/329 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/224 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-pr/builds/167 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-pr/builds/161 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/176 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/181 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/167 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/144 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-pr/builds/139 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-main/builds/21.1
We should fix them