etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.5k stars 9.74k forks source link

Migrate the robustness tests to prow #18136

Closed serathius closed 2 weeks ago

serathius commented 3 months ago

What would you like to be added?

After the last robustness team meeting it was clear how superior Prow + TestGrid is over GitHub actions.

https://testgrid.k8s.io/sig-etcd-robustness#Summary vs https://github.com/etcd-io/etcd/actions/workflows/robustness-nightly.yaml

Advantages:

TODO:

cc @jmhbnz @ivanvc

Why is this needed?

Migration to Prow opens a new chapter for stability and debuggability of robustness test with the goal of making the process more approachable for new contributors.

henrybear327 commented 3 months ago

@ArkaSaha30

ivanvc commented 3 months ago

Do we have access to arm nodes in the Prow infra? The last I remember is that we were waiting for them. I don't see any updates regarding this on https://github.com/kubernetes/k8s.io/issues/6102. So, it may be a blocker for the second point.

serathius commented 3 months ago

Not great, but I will not block the migration regardless. Robustness tests only bring value if there is someone willing to review them. With Prow being much better, no-one will be willing to review arm robustness failures.

ivanvc commented 3 months ago

I can see two options: pause running robustness for the ARM architecture (not ideal) or keep ARM tests running on GitHub actions.

I don't see much activity in https://github.com/kubernetes/k8s.io/issues/6102. Who or where would be a good place to ask for a status update/ETA for ARM nodegroups?

jmhbnz commented 3 months ago

Hi @upodroid - We spoke at KubeCon EU Paris about a dedicated arm64 cluster for prow. Can you please provide an update on the timeline for it being available?

serathius commented 3 months ago

I can see two options: pause running robustness for the ARM architecture (not ideal) or keep ARM tests running on GitHub actions.

I was thinking about the second option, however due to sub-par user experience I expect it would be equal the first one.

ivanvc commented 3 months ago

Discussed on Slack with Arka, we'll be working on the following at the moment:

/assign @ArkaSaha30 @ivanvc

k8s-ci-robot commented 3 months ago

@ivanvc: GitHub didn't allow me to assign the following users: ArkaSaha30.

Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to [this](https://github.com/etcd-io/etcd/issues/18136#issuecomment-2155285920): >Discussed on Slack with Arka, we'll be working on the following at the moment: >- Migrate robustness documentation to Prow https://github.com/etcd-io/etcd/tree/main/tests/robustness >- Add robustness tests presubmits > >/assign @ArkaSaha30 @ivanvc Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
ArkaSaha30 commented 3 months ago

/assign

ArkaSaha30 commented 3 months ago

Currently, the robustness tests on Github Actions run only on main or PRs to main. Do we need to run it on release-3.5 and release-3.4?
The existing robustness periodic and presubmit can be configured to handle all the 3 branches.

serathius commented 3 months ago

There are no robustness test on other branches beside main. We develop and run robustness test from main branch and validate binaries build from older branches.

ivanvc commented 3 months ago

We have finished the first and the third tasks. When would you think is a good time to remove the GitHub action @serathius?

We can't move forward with the second, as we don't have a timeline on when ARM runners are going to be available.

serathius commented 3 months ago

We have finished the first and the third tasks. When would you think is a good time to remove the GitHub action @serathius?

We can keep arm64 on Github actions to not block on it.

ivanvc commented 3 months ago

@ArkaSaha30, can you help with

Remove non-arm robustness tests from github actions.?

Thanks.

jmhbnz commented 1 month ago

Update - arm64 runners were enabled in prow, (refer k8s-infra slack discussions: 1, 2)

serathius commented 1 month ago

ci-etcd-robustness-arm64 looks broken. image

jmhbnz commented 1 month ago

ci-etcd-robustness-arm64 looks broken.

Looking at most recent full run it says:

Test started today at 5:36 PM failed after 1h19m14s.

Job logs show:

 {"Time":"2024-08-08T06:47:33.907178941Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/robustness","Test":"TestRobustnessExploratory/EtcdHighTraffic/ClusterOfSize1","Output":"/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd_--version) (79484): Git SH{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:173","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2024-08-08T06:47:36Z"}
++ early_exit_handler
++ '[' -n 17 ']'
++ kill -TERM 17
++ cleanup_dind
++ [[ false == \t\r\u\e ]]
+ EXIT_VALUE=143 

Looks like job was interrupted? Or is that expected / unrelated output?

Job config is here.

Job history shows as aborted: https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-etcd-robustness-arm64

Edit: Interestingly ci-etcd-robustness-main-arm64 was fine https://testgrid.k8s.io/sig-etcd-robustness#ci-etcd-robustness-main-arm64. I am not too sure on the difference between those two jobs.

ivanvc commented 1 month ago

@jmhbnz, @serathius, are we ready to remove optional: true from the robustness presubmit jobs and mark this issue as complete?

jmhbnz commented 1 month ago

@jmhbnz, @serathius, are we ready to remove optional: true from the robustness presubmit jobs and mark this issue as complete?

We can remove optional: true from the presubmits I believe, the job seems to be behaving about the same if not better than the amd64 equivalent presubmit.

I don't think we can close this yet though, we still have an problem with the ci-etcd-robustness-arm64. Perhaps team at next robustness meeting could take a look at that as I am out of my area of expertise trying to debug it.

Edit: Defer to @serathius as tech lead for robustness for final decision on optional: true.

serathius commented 1 month ago

Edit: Defer to @serathius as tech lead for robustness for final decision on optional: true.

Think we are ok to make presubmit job blocking.

I don't think we can close this yet though, we still have an problem with the ci-etcd-robustness-arm64. Perhaps team at next robustness meeting could take a look at that as I am out of my area of expertise trying to debug it.

My high level question, why do we have separated ci-etcd-robustness-amd64 and ci-etcd-robutstness-main-amd64 (mirrored for arm)?

ivanvc commented 1 month ago

I don't think we can close this yet though, we still have an problem with the ci-etcd-robustness-arm64. Perhaps team at next robustness meeting could take a look at that as I am out of my area of expertise trying to debug it.

My bad, I thought it was addressed in etcd-io/etcd#17593. I see it's a different issue.

It looks like they are consistently aborted at around 80 minutes. Following early_exit_handler, it seems like the process is being interrupted by its parent. Which sounds consistent with the output from the logs:

{"Time":"2024-08-16T22:50:16.205037989Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/robustness","Test":"TestRobustnessExploratory","Output":"/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd_--version) (80167): Go OS{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:173","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2024-08-16T22:50:17Z"}

I wonder if the ARM node or pods inside the node get rotated after 80m.

My high level question, why do we have separated ci-etcd-robustness-amd64 and ci-etcd-robutstness-main-amd64 (mirrored for arm)?

I'm unsure about this one. Should we only have ci-etcd-robustness-amd64?

ivanvc commented 1 month ago

Just giving an update that I have a thread in #sig-k8s-infra. It looks like the bug is in the infra, not the job itself.

ivanvc commented 1 month ago

Link to kubernetes/k8s.io#7241

ivanvc commented 3 weeks ago

The ARM issues are now solved. There are multiple green runs in prow (https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-etcd-robustness-arm64).

@serathius, should we delete ci-etcd-robustness-main-arm64 and only keep ci-etcd-robustness-arm64?

serathius commented 3 weeks ago

Don't know the exact differences in the job definition but from those 4 jobs

We only need 2 one for amd64 one for arm. As for the name I think it would be better follow the same convention as ci-etcd-robustness-release35-amd64 and use the branch name in the job name. So preferably we leave

ivanvc commented 3 weeks ago

The difference between the jobs is that ci-etcd-robustness-{amd64,arm64} enables gofail make gofail-enable and builds the project (make build). While ci-etcd-robustness-main-{amd64,arm64}` doesn't.

Which one would we need to keep, the one with gofail enabled or the other?

ivanvc commented 3 weeks ago

The GitHub workflows we used to have didn't enable gofail, nor were we building the project. We should keep ci-etcd-robustness-main-{arm64,amd64}, which are already consistent with the job naming you suggested.

jmhbnz commented 3 weeks ago

The GitHub workflows we used to have didn't enable gofail, nor were we building the project. We should keep ci-etcd-robustness-main-{arm64,amd64}, which are already consistent with the job naming you suggested.

Good spotting @ivanvc. That seems reasonable to me, defer to @serathius for final decision.

serathius commented 3 weeks ago

Lack of building and enabling gofail is expected because the difference between targets make test-robustness which just runs tests (on locally available binary), make test-robustness-main tests etcd from the main branch (downloads, enables gofail and builds).

With the differences cleaned up I think we can leave ci-etcd-robustness-main-{arm64,amd64}.

ivanvc commented 3 weeks ago

I believe the only outstanding task from this issue is marking the pre-submit jobs as blocking. @serathius, do you think we should do this soon, or should we leave them running for a little longer?

serathius commented 2 weeks ago

I think we are good to mark them blocking. Robustness tests have been stable on both PRs and periodics.

image image

ivanvc commented 2 weeks ago

I'll close this issue now since we don't have any outstanding tasks (please reopen if needed).

Thanks to everyone who contributed to migrating the robustness tests.