cloud-bulldozer / k8s-netperf

Running Networking Performance Tests against K8s
Apache License 2.0
28 stars 17 forks source link

[BUG] node2node (HostNetwork is true, service is false scenario) failing on ROSA cluster #110

Closed venkataanil closed 11 months ago

venkataanil commented 11 months ago

Describe the bug A clear and concise description of what the bug is. node2node (hostNetwork is true, service is false scenario) is failing in "ROSA" cluster with below error (k8s-netperf version is v0.1.15). This is not reproducible on self-managed AWS cluster.

time="2023-10-09 16:02:33" level=debug msg="server Running on ip-10-0-217-22.us-west-2.compute.internal with IP 10.0.217.22" time="2023-10-09 16:02:33" level=debug msg="client-across Running on ip-10-0-205-189.us-west-2.compute.internal with IP 10.0.205.189" time="2023-10-09 16:02:33" level=debug msg="Executing workloads. hostNetwork is true, service is false" time="2023-10-09 16:02:33" level=debug msg="🔥 Client (client-host-6548b5cbc4-sgt44,10.0.217.22) starting netperf against server : 10.0.205.189" time="2023-10-09 16:02:33" level=info msg="🗒️ Running netperf TCP_STREAM (service false) for 300s " time="2023-10-09 16:02:33" level=debug msg="[bash super-netperf 1 -H 10.0.205.189 -l 300 -t TCP_STREAM -- -k rt_latency,p99_latency,throughput,throughput_units,remote_recv_calls,local_send_calls,local_transport_retrans -m 64 -R 1]" time="2023-10-09 16:04:43" level=debug msg="(standard_in) 2: syntax error\r\n(standard_in) 2: syntax error\r\n(standard_in) 2: syntax error\r\n(standard_in) 2: syntax error\r\n(standard_in) 2: syntax error\r\nestablish control: are you sure there is a netserver listening on 10.0.205.189 at port 12865?\r\nRT_LATENCY=\r\nP99_LATENCY=\r\nTHROUGHPUT=\r\nLOCAL_TRANSPORT_RETRANS=\r\nREMOTE_RECV_CALLS=\r\nLOCAL_SEND_CALLS=" time="2023-10-09 16:04:43" level=debug msg="Executing workloads. hostNetwork is false, service is false"

To Reproduce Steps to reproduce the behavior:

  1. Run k8s-netperf on ROSA cluster with --all flag (i.e testing hostnetwork)
vishnuchalla commented 11 months ago

Similar failure in one of the PROW GCP runs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-gcp-4.13-nightly-x86-perfscale-gcp-data-path-6nodes-periodic/1711252433637740544/artifacts/perfscale-gcp-data-path-6nodes-periodic/openshift-qe-network-perf/build-log.txt

Also this error started appearing only after this PR: https://github.com/cloud-bulldozer/k8s-netperf/pull/109. Before this all the jobs were running successfully.

venkataanil commented 11 months ago

https://github.com/cloud-bulldozer/k8s-netperf/pull/109 corrected the behaviour of node2node. Prior to this PR, node2node test was actually executng node2pod scenario. After adding AWS security group which allows port 12865 on AWS workers, netperf client is able to reach the network server. We need to enhance our perf scripts to enable port 12865 on ROSA AWS workers before running netperf.