Closed serathius closed 2 months ago
@ArkaSaha30
Do we have access to arm nodes in the Prow infra? The last I remember is that we were waiting for them. I don't see any updates regarding this on https://github.com/kubernetes/k8s.io/issues/6102. So, it may be a blocker for the second point.
Not great, but I will not block the migration regardless. Robustness tests only bring value if there is someone willing to review them. With Prow being much better, no-one will be willing to review arm robustness failures.
I can see two options: pause running robustness for the ARM architecture (not ideal) or keep ARM tests running on GitHub actions.
I don't see much activity in https://github.com/kubernetes/k8s.io/issues/6102. Who or where would be a good place to ask for a status update/ETA for ARM nodegroups?
Hi @upodroid - We spoke at KubeCon EU Paris about a dedicated arm64
cluster for prow. Can you please provide an update on the timeline for it being available?
I can see two options: pause running robustness for the ARM architecture (not ideal) or keep ARM tests running on GitHub actions.
I was thinking about the second option, however due to sub-par user experience I expect it would be equal the first one.
Discussed on Slack with Arka, we'll be working on the following at the moment:
/assign @ArkaSaha30 @ivanvc
@ivanvc: GitHub didn't allow me to assign the following users: ArkaSaha30.
Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
/assign
Currently, the robustness tests on Github Actions run only on main or PRs to main. Do we need to run it on release-3.5
and release-3.4
?
The existing robustness periodic and presubmit can be configured to handle all the 3 branches.
There are no robustness test on other branches beside main. We develop and run robustness test from main branch and validate binaries build from older branches.
We have finished the first and the third tasks. When would you think is a good time to remove the GitHub action @serathius?
We can't move forward with the second, as we don't have a timeline on when ARM runners are going to be available.
We have finished the first and the third tasks. When would you think is a good time to remove the GitHub action @serathius?
We can keep arm64 on Github actions to not block on it.
@ArkaSaha30, can you help with
Remove non-arm robustness tests from github actions.?
Thanks.
Update - arm64
runners were enabled in prow, (refer k8s-infra slack discussions: 1, 2)
arm64
robustness jobs.arm64
robustness GitHub actions workflows.ci-etcd-robustness-arm64
looks broken.
ci-etcd-robustness-arm64
looks broken.
Looking at most recent full run it says:
Test started today at 5:36 PM failed after 1h19m14s.
Job logs show:
{"Time":"2024-08-08T06:47:33.907178941Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/robustness","Test":"TestRobustnessExploratory/EtcdHighTraffic/ClusterOfSize1","Output":"/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd_--version) (79484): Git SH{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:173","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2024-08-08T06:47:36Z"}
++ early_exit_handler
++ '[' -n 17 ']'
++ kill -TERM 17
++ cleanup_dind
++ [[ false == \t\r\u\e ]]
+ EXIT_VALUE=143
Looks like job was interrupted? Or is that expected / unrelated output?
Job history shows as aborted
: https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-etcd-robustness-arm64
Edit: Interestingly ci-etcd-robustness-main-arm64
was fine https://testgrid.k8s.io/sig-etcd-robustness#ci-etcd-robustness-main-arm64. I am not too sure on the difference between those two jobs.
@jmhbnz, @serathius, are we ready to remove optional: true
from the robustness presubmit jobs and mark this issue as complete?
@jmhbnz, @serathius, are we ready to remove
optional: true
from the robustness presubmit jobs and mark this issue as complete?
We can remove optional: true
from the presubmits I believe, the job seems to be behaving about the same if not better than the amd64
equivalent presubmit.
I don't think we can close this yet though, we still have an problem with the ci-etcd-robustness-arm64
. Perhaps team at next robustness meeting could take a look at that as I am out of my area of expertise trying to debug it.
Edit: Defer to @serathius as tech lead for robustness for final decision on optional: true
.
Edit: Defer to @serathius as tech lead for robustness for final decision on optional: true.
Think we are ok to make presubmit job blocking.
I don't think we can close this yet though, we still have an problem with the ci-etcd-robustness-arm64. Perhaps team at next robustness meeting could take a look at that as I am out of my area of expertise trying to debug it.
My high level question, why do we have separated ci-etcd-robustness-amd64
and ci-etcd-robutstness-main-amd64
(mirrored for arm)?
I don't think we can close this yet though, we still have an problem with the
ci-etcd-robustness-arm64
. Perhaps team at next robustness meeting could take a look at that as I am out of my area of expertise trying to debug it.
My bad, I thought it was addressed in etcd-io/etcd#17593. I see it's a different issue.
It looks like they are consistently aborted at around 80 minutes. Following early_exit_handler
, it seems like the process is being interrupted by its parent. Which sounds consistent with the output from the logs:
{"Time":"2024-08-16T22:50:16.205037989Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/robustness","Test":"TestRobustnessExploratory","Output":"/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd_--version) (80167): Go OS{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:173","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2024-08-16T22:50:17Z"}
I wonder if the ARM node or pods inside the node get rotated after 80m.
My high level question, why do we have separated
ci-etcd-robustness-amd64
andci-etcd-robutstness-main-amd64
(mirrored for arm)?
I'm unsure about this one. Should we only have ci-etcd-robustness-amd64
?
Just giving an update that I have a thread in #sig-k8s-infra. It looks like the bug is in the infra, not the job itself.
Link to kubernetes/k8s.io#7241
The ARM issues are now solved. There are multiple green runs in prow (https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-etcd-robustness-arm64).
@serathius, should we delete ci-etcd-robustness-main-arm64
and only keep ci-etcd-robustness-arm64
?
Don't know the exact differences in the job definition but from those 4 jobs
We only need 2 one for amd64 one for arm. As for the name I think it would be better follow the same convention as ci-etcd-robustness-release35-amd64
and use the branch name in the job name. So preferably we leave
The difference between the jobs is that ci-etcd-robustness-{amd64,arm64}
enables gofail make gofail-enable
and builds the project (make build
). While ci-etcd-robustness-main-{amd64,arm64}` doesn't.
ci-etcd-robustness-arm64
: https://github.com/kubernetes/test-infra/blob/cb419f072809b7554602219dadee3b0433b5682d/config/jobs/etcd/etcd-periodics.yaml#L171-L183
result=0
apt-get -o APT::Update::Error-Mode=any update && apt-get --yes install cmake libfuse3-dev libfuse3-3 fuse3
sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf
make install-lazyfs
set -euo pipefail
GO_TEST_FLAGS="-v --count 120 --timeout '200m' --run TestRobustnessExploratory"
make gofail-enable
make build
VERBOSE=1 GOOS=linux GOARCH=arm64 CPU=8 EXPECT_DEBUG=true GO_TEST_FLAGS=${GO_TEST_FLAGS} RESULTS_DIR=/data/results make test-robustness || result=$?
if [ -d /data/results ]; then
zip -r ${ARTIFACTS}/results.zip /data/results
fi
exit $result
ci-etcd-robustness-main-arm64
: https://github.com/kubernetes/test-infra/blob/cb419f072809b7554602219dadee3b0433b5682d/config/jobs/etcd/etcd-periodics.yaml#L263-L273
result=0
apt update && apt-get --yes install cmake libfuse3-dev libfuse3-3 fuse3
sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf
make install-lazyfs
set -euo pipefail
GO_TEST_FLAGS="-v --count 120 --timeout '200m' --run TestRobustnessExploratory"
VERBOSE=1 GOOS=linux GOARCH=arm64 CPU=8 EXPECT_DEBUG=true GO_TEST_FLAGS=${GO_TEST_FLAGS} RESULTS_DIR=/data/results make test-robustness-main || result=$?
if [ -d /data/results ]; then
zip -r ${ARTIFACTS}/results.zip /data/results
fi
exit $result
Which one would we need to keep, the one with gofail enabled or the other?
The GitHub workflows we used to have didn't enable gofail, nor were we building the project. We should keep ci-etcd-robustness-main-{arm64,amd64}
, which are already consistent with the job naming you suggested.
The GitHub workflows we used to have didn't enable gofail, nor were we building the project. We should keep
ci-etcd-robustness-main-{arm64,amd64}
, which are already consistent with the job naming you suggested.
Good spotting @ivanvc. That seems reasonable to me, defer to @serathius for final decision.
Lack of building and enabling gofail is expected because the difference between targets make test-robustness
which just runs tests (on locally available binary), make test-robustness-main
tests etcd from the main branch (downloads, enables gofail and builds).
With the differences cleaned up I think we can leave ci-etcd-robustness-main-{arm64,amd64}
.
I believe the only outstanding task from this issue is marking the pre-submit jobs as blocking. @serathius, do you think we should do this soon, or should we leave them running for a little longer?
I think we are good to mark them blocking. Robustness tests have been stable on both PRs and periodics.
I'll close this issue now since we don't have any outstanding tasks (please reopen if needed).
Thanks to everyone who contributed to migrating the robustness tests.
What would you like to be added?
After the last robustness team meeting it was clear how superior Prow + TestGrid is over GitHub actions.
https://testgrid.k8s.io/sig-etcd-robustness#Summary vs https://github.com/etcd-io/etcd/actions/workflows/robustness-nightly.yaml
Advantages:
TODO:
cc @jmhbnz @ivanvc
Why is this needed?
Migration to Prow opens a new chapter for stability and debuggability of robustness test with the goal of making the process more approachable for new contributors.