cloud-bulldozer / benchmark-operator

The Chuck Norris of cloud benchmarks
Apache License 2.0
282 stars 127 forks source link

uperf client pods are not being created in pod2service network perf tests #798

Closed SachinNinganure closed 1 year ago

SachinNinganure commented 1 year ago

Pod2service Network perf (uperf)test for ibm-cloud on ocp412 fail as the uperf-client pods are not being created.

test -->https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/network-perf/369/console

https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/network-perf/368/

@qiliRedHat @paigerube14 @mffiedler @rsevilla87

paigerube14 commented 1 year ago

I created an ibm 4.13 cluster while testing some other things, is the 'uperf-client-*' seen below what you were missing? Wonder why it wouldn't work for 4.12 but would for 4.13

 % oc get pods -n benchmark-operator                            
NAME                                            READY   STATUS    RESTARTS   AGE
backpack-37a49c24-44t64                         1/1     Running   0          2m25s
backpack-37a49c24-9rwvp                         1/1     Running   0          2m25s
backpack-37a49c24-c2jtw                         1/1     Running   0          2m25s
backpack-37a49c24-np4h8                         1/1     Running   0          2m25s
backpack-37a49c24-qvr4f                         1/1     Running   0          2m25s
backpack-37a49c24-szbjd                         1/1     Running   0          2m25s
benchmark-controller-manager-7d694c6b9c-94qxg   2/2     Running   0          3m41s
uperf-client-172.30.78.189-37a49c24-hbhnm       1/1     Running   0          32s
uperf-server-0-37a49c24-z4zhr                   1/1     Running   0          63s

Looking on kibana from one of the runs you listed I do see a field that says "client_node:sn-npt-ibm412-fs4fv-worker-3-4lgbl", not sure if that's accurate or the data is found

Run for reference: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/network-perf/370/console

qiliRedHat commented 1 year ago

@SachinNinganure Hav you described the pod and got some events to check what happens to the pod?

qiliRedHat commented 1 year ago

BTW: Is that an ibm-cloud specific issue?

SachinNinganure commented 1 year ago

@paigerube14 the test you executed also failed client pod did not start there as-well . Looks like you tried to get the results when the test started and client pods looked good at the start but not after that.

11-17 02:47:57.645 ripsaw-cli:ripsaw.models.benchmark:ERROR :: Benchmark exception: The benchmark uperf-pod2svc-2 timed out 11-17 02:47:57.645 Wed Nov 16 21:17:57 UTC 2022 Benchmark failed, dumping workload more recent logs 11-17 02:47:57.944 NAME READY STATUS RESTARTS AGE 11-17 02:47:57.944 uperf-server-0-37a49c24-psdrt 1/1 Running 0 122m 11-17 02:47:57.944 uperf-server-1-37a49c24-9sprz 1/1 Running 0 122m 11-17 02:47:58.245 Wed Nov 16 21:17:58 UTC 2022 Writing pod logs in /tmp/tmp.3mXNBirbJq/uperf-server-0-37a49c24-psdrt.log

SachinNinganure commented 1 year ago

@qiliRedHat for now I am seeing them on 412 ibm-ocp , 410 and 411 looked good.

mffiedler commented 1 year ago

Investigating IBM cloud 4.12 today

SachinNinganure commented 1 year ago

NPT-ibm-412.odt added file with log info

qiliRedHat commented 1 year ago

From your log file, the backpack pod had problem that worth digging.

162m        Warning   Unhealthy           pod/backpack-fa5f3f56-xlwlp                          Readiness probe failed: ls: cannot access '/tmp/indexed': No such file or directory
122m        Normal    Killing             pod/backpack-fa5f3f56-xlwlp                          Stopping container backpack

In the failed Jenkins jobs log you provided, uperf-pod2svc-1 all completed while uperf-pod2svc-2 timed out. They are 2 same runs given Pairs defaults to 2, curious about what are the differences between the 2 runs.

SachinNinganure commented 1 year ago

added the log files from benchmark-controller-manager from 2 different tests one from aws where we see pod2svc test success and the other from ibm-412 which is failing to start client-pods

[ aws-412-npt-pass.log.gz ibm-412-npt-fail.log.gz ]

mffiedler commented 1 year ago

Try running with METADATA_COLLECTION=false

qiliRedHat commented 1 year ago

I the aws-412-npt-pass.log.gz, I saw the info about the uperf-client batch job {"level":"info","ts":1668759356.2989795,"logger":"proxy","msg":"Cache miss: batch/v1, Kind=Job, benchmark-operator/uperf-client-172.30.131.128-194d3529"} While in the ibm-412-npt-fail.log.gz, I can't find info about uperf-client. So the pod is 'missing' because the batch job is not created.

stale[bot] commented 1 year ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

rsevilla87 commented 1 year ago

This issue is still present, reopening.

SachinNinganure commented 1 year ago

METADATA_COLLECTION=false

I am running network tests(pod2svc) for "AWS - OVN - Customer VPC - Hybrid OS" cluster on ocp413. I hit the same error as the uperf_client pods fail to start/get created.

[sninganu@sninganu ~]$ oc logs -f benchmark-controller-manager-86d495644c-x922k -c manager|less|grep "Cache miss" {"level":"info","ts":1679406911.4110515,"logger":"proxy","msg":"Cache miss: apps/v1, Kind=DaemonSet, benchmark-operator/backpack-0cc4bd2b"} {"level":"info","ts":1679406912.2812552,"logger":"proxy","msg":"Cache miss: apps/v1, Kind=DaemonSet, benchmark-operator/backpack-0cc4bd2b"} {"level":"info","ts":1679406975.843491,"logger":"proxy","msg":"Cache miss: /v1, Kind=Service, benchmark-operator/uperf-service-0-0cc4bd2b"} {"level":"info","ts":1679406976.8609834,"logger":"proxy","msg":"Cache miss: batch/v1, Kind=Job, benchmark-operator/uperf-server-0-0cc4bd2b"} {"level":"info","ts":1679406982.1258848,"logger":"proxy","msg":"Cache miss: /v1, Kind=ConfigMap, benchmark-operator/uperf-test-0-0cc4bd2b"}

Test Link --> https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/network-perf/651/parameters/ the result is same when run manually aswell

stale[bot] commented 1 year ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.