Closed MuhammadMunir12 closed 2 years ago
I have asked @mukrishn to check this out, but looking at the log, it would seem the issue is that you didn't provide a Elasticsearch URL. This would be a bug, as we shouldn't require a ES URL to be passed. If my assumptions are correct.
To verify this, you could use our development ES instance to try this. https://search-perfscale-dev-chmf5l4sh66lvxbnadi4bznl3a.us-west-2.es.amazonaws.com:443
@jtaleric That is not requirement for running this as benchmark btw, may be it's required in Uperf. But i'm not sure about it.
@jtaleric I have tried running on your test cluster but facing the same issue. The details are given below:
apiVersion: ripsaw.cloudbulldozer.io/v1alpha1
kind: Benchmark
metadata:
name: testpmd-benchmark
namespace: my-ripsaw
spec:
elasticsearch:
url: "https://search-perfscale-dev-chmf5l4sh66lvxbnadi4bznl3a.us-west-2.es.amazonaws.com:443"
workload:
name: testpmd
args:
privileged: true
pin: true
pin_testpmd: "worker1"
pin_trex: "worker1"
networks:
testpmd:
- name: testpmd-sriov-network
count: 2 # Interface count, Min 2
trex:
- name: testpmd-sriov-network
count: 2
Pods (testpmd and trex) goes in error state. Logs are attached below:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <invalid> default-scheduler Successfully assigned my-ripsaw/testpmd-application-pod-ae0f8c98-28wnk to r192bc2.oss.labs
Normal AddedInterface <invalid> multus Add eth0 [10.131.1.111/23]
Normal AddedInterface <invalid> multus Add net1 [10.57.1.123/24] from my-ripsaw/testpmd-sriov-network
Normal AddedInterface <invalid> multus Add net2 [10.57.1.124/24] from my-ripsaw/testpmd-sriov-network
Normal Pulled <invalid> kubelet Successfully pulled image "registry.redhat.io/openshift4/dpdk-base-rhel8:v4.6" in 1.099241488s
Normal Pulling <invalid> (x2 over <invalid>) kubelet Pulling image "registry.redhat.io/openshift4/dpdk-base-rhel8:v4.6"
Normal Created <invalid> (x2 over <invalid>) kubelet Created container testpmd
Normal Started <invalid> (x2 over <invalid>) kubelet Started container testpmd
Normal Pulled <invalid> kubelet Successfully pulled image "registry.redhat.io/openshift4/dpdk-base-rhel8:v4.6" in 1.050182689s
Warning BackOff <invalid> (x2 over <invalid>) kubelet Back-off restarting failed container
$ oc logs -f trex-traffic-gen-pod-ae0f8c98-7k8xm
2021-08-12T12:42:31Z - INFO - MainProcess - run_snafu: logging level is INFO
2021-08-12T12:42:31Z - INFO - MainProcess - _load_benchmarks: Successfully imported 1 benchmark modules: uperf
2021-08-12T12:42:31Z - INFO - MainProcess - _load_benchmarks: Failed to import 0 benchmark modules:
2021-08-12T12:42:31Z - INFO - MainProcess - run_snafu: Using elasticsearch server with host: https://search-perfscale-dev-chmf5l4sh66lvxbnadi4bznl3a.us-west-2.es.amazonaws.com:443
2021-08-12T12:42:31Z - INFO - MainProcess - run_snafu: Using index prefix for ES: ripsaw-testpmd
2021-08-12T12:42:31Z - INFO - MainProcess - run_snafu: Turning off TLS certificate verification
2021-08-12T12:42:31Z - INFO - MainProcess - run_snafu: Connected to the elasticsearch cluster with info as follows:
2021-08-12T12:42:32Z - INFO - MainProcess - run_snafu: {
"name": "510fddd9ea3242aefad127567cffc68e",
"cluster_name": "415909267177:perfscale-dev",
"cluster_uuid": "Xz2IU4etSieAeaO2j-QCUw",
"version": {
"number": "7.10.2",
"build_flavor": "oss",
"build_type": "tar",
"build_hash": "unknown",
"build_date": "2021-05-21T20:25:46.519671Z",
"build_snapshot": false,
"lucene_version": "8.7.0",
"minimum_wire_compatibility_version": "6.8.0",
"minimum_index_compatibility_version": "6.0.0-beta1"
},
"tagline": "You Know, for Search"
}
2021-08-12T12:42:32Z - INFO - MainProcess - py_es_bulk: Using streaming bulk indexer
2021-08-12T12:42:32Z - INFO - MainProcess - wrapper_factory: identified trex as the benchmark wrapper
2021-08-12T12:42:32Z - INFO - MainProcess - trigger_trex: Starting TRex Traffic Generator..
Traceback (most recent call last):
File "/usr/local/bin/run_snafu", line 33, in <module>
sys.exit(load_entry_point('snafu', 'console_scripts', 'run_snafu')())
File "/opt/snafu/snafu/run_snafu.py", line 139, in main
es, process_generator(index_args, parser), parallel_setting
File "/opt/snafu/snafu/utils/py_es_bulk.py", line 171, in streaming_bulk
for ok, resp_payload in streaming_bulk_generator:
File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 320, in streaming_bulk
actions, chunk_size, max_chunk_bytes, client.transport.serializer
File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 155, in _chunk_actions
for action, data in actions:
File "/opt/snafu/snafu/utils/py_es_bulk.py", line 117, in actions_tracking_closure
for cl_action in cl_actions:
File "/opt/snafu/snafu/run_snafu.py", line 194, in process_generator
for action, index in data_object.emit_actions():
File "/opt/snafu/snafu/trex_wrapper/trigger_trex.py", line 68, in emit_actions
documents = self._json_payload(stdout)
File "/opt/snafu/snafu/trex_wrapper/trigger_trex.py", line 33, in _json_payload
payload = json.loads(data)
File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
@MuhammadMunir12 so you mentioned that your testpmd
pod kept restarting and got in to crashloopback
that worries me much, if testpmd
pod isn't running trex
would definitely fail. Could you paste the error from testpmd
pod? oc logs testpmd-application-pod-ae0f8c98-28wnk
I am assuming the pods are pinned to a different worker node although your referred CR has same, just checking because we have the pod Antiaffinity check here
@mukrishn I have pinned them on different worker nodes this time but still the issue is same. Logs for testpmd pod are attached. I am using default values for testpmd as mentioned in operator's guide.
$ oc logs -f testpmd-application-pod-f28e6493-m4t2t
EAL: Detected 72 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Auto-detected process type: PRIMARY
EAL: Multi-process socket /tmp/dpdk/pg/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:d8:0a.2 on NUMA socket 1
EAL: probe driver: 8086:154c net_i40e_vf
EAL: PCI device 0000:d8:0a.3 on NUMA socket 1
EAL: probe driver: 8086:154c net_i40e_vf
testpmd: No probed ethernet devices
Fail: input rxq (1) can't be greater than max_rx_queues (0) of port 0
EAL: Error - exiting with code: 1
Cause: rxq 1 invalid - must be >= 0 && <= 0
@jtaleric That is not requirement for running this as benchmark btw, may be it's required in Uperf. But i'm not sure about it.
It shouldn't be a requirement. However, some things do slip through the cracks. However, it would be good to catch this issue, vs drop a traceback.
@MuhammadMunir12 testpmd
itself failed and need to fix it here, from the log you are using interface from the NUMA socket 1
and i believe the hugepages are created on the same socket as well but you haven't provided the socket_memory
for the testpmd app on the right socket, you can mention it as part of the CR(by default script assigns this to socket 0 only,)
workload:
args:
socket_memory: 0,1024 # its in the socket order, default would be 1024,0
More config params are here.
Also I would like to see your SRIOV policy and PAO profile, just to double check if everything is configured as per the doc.
Hi, I filed https://github.com/cloud-bulldozer/benchmark-wrapper/pull/322 to help debugging the json parsing issues. As soon as it gets merged, mind you try running the workload again adding debug: true
in the Benchmark YAML?
@mukrishn I have added socket_memory from Numa 1, then tried other parameters like:
memory_channels: 4
forwarding_cores: 4
rx_queues: 1
tx_queues: 1
rx_descriptors: 1024
tx_descriptors: 1024
forward_mode: "mac"
stats_period: 1
disable_rss: true
But still the issue is still there.
@MuhammadMunir12 could you share the SRIOV and PAO policies you used ?
@rsevilla87 Kindly verify if I've added debug flag at the right place. With this CR, the issue still persists.
$ cat dpdk-app.yaml
apiVersion: ripsaw.cloudbulldozer.io/v1alpha1
kind: Benchmark
metadata:
name: testpmd-benchmark
namespace: my-ripsaw
spec:
clustername: r192b
workload:
name: testpmd
debug: true
args:
socket_memory: 0,1024
memory_channels: 4
forwarding_cores: 4
rx_queues: 1
tx_queues: 1
rx_descriptors: 1024
tx_descriptors: 1024
forward_mode: "mac"
stats_period: 1
disable_rss: true
privileged: true
pin: true
pin_testpmd: "r192bc2.oss.labs"
pin_trex: "r192bmw.oss.labs"
networks:
testpmd:
- name: testpmd-sriov-network
count: 2 # Interface count, Min 2
trex:
- name: testpmd-sriov-network
count: 2
@mukrishn The DPDK node policy is:
$ cat intel-dpdk-node-policy.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: intel-dpdk-node-policy-for-testpmd
namespace: openshift-sriov-network-operator
spec:
resourceName: intelnics
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
numVfs: 5
nicSelector:
pfNames: ["ens3f1"]
rootDevices: ["0000:d8:00.1"]
deviceType: netdevice
isRdma: false
The PAO manifest is:
$ cat pao.yaml
apiVersion: v1
kind: Namespace
metadata:
name: openshift-performance-addon-operator
labels:
openshift.io/run-level: "1"
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: openshift-performance-addon-operator
namespace: openshift-performance-addon-operator
spec:
targetNamespaces:
- openshift-performance-addon-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: openshift-performance-addon-operator-subscription
namespace: openshift-performance-addon-operator
spec:
channel: "4.6"
name: performance-addon-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
The performance profile is:
$ cat perf-profile.yaml
apiVersion: performance.openshift.io/v1
kind: PerformanceProfile
metadata:
name: r192b-performanceprofile
spec:
additionalKernelArgs:
- nmi_watchdog=0
- audit=0
- mce=off
- processor.max_cstate=1
- idle=poll
- intel_idle.max_cstate=0
cpu:
isolated: "2-35,38-71"
reserved: "0,1,36,37"
hugepages:
defaultHugepagesSize: 1G
pages:
- size: "1G"
count: 16
realTimeKernel:
enabled: true
numa:
topologyPolicy: "restricted"
# topologyPolicy: "best-effort" # May change performance, but this can be used
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
The only strange thing is that I have used "netdevice" driver with Intel NICs because "vfio-pci" was not creating VFs.
@MuhammadMunir12, debug: true
is nested under args
@rsevilla87 That doesn't change anything with debug flag set to true.
@MuhammadMunir12 It has to be vfio-pci
for Intel XXV710 cards, as you can see the testpmd
application tries to load VFIO modules. We have documented in SR-IOV configuration for different cards.
Also make sure you have Hugepages allocated on NUMA Socket 1 on the worker nodes. cat /sys/devices/system/node/node*/meminfo | fgrep Huge
@rsevilla87 That doesn't change anything with debug flag set to true.
Hey, I just merged the debug patch, can you try again?
@mukrishn Using vfio-pci
for node policy doesn't create VFs on the given interface.
And yes hugepages are allocated to both Numa sockets.
Node 0 HugePages_Total: 8
Node 0 HugePages_Free: 8
Node 0 HugePages_Surp: 0
Node 1 HugePages_Total: 8
Node 1 HugePages_Free: 8
Node 1 HugePages_Surp: 0
@rsevilla87 Thank you. I have checked but it's not working. As @mukrishn mentioned that for Intel driver must be vfio-pci
but with this driver I'm unable to create VFs using node policy, VFs are created only when I use netdevice
driver type which is for Mellanox NICs. I don't understand why it's behaving like this.
@MuhammadMunir12 May be you can try, netdevice
with isRdma: true
.
@mukrishn That can be tried, but it's set to true in Mellanox NICs for DPDK.
@mukrishn Didn't work on netdevice
withisRdma: true
.
@jtaleric Any updates at your end regarding this issue?
@jtaleric Can we see which dpdk version is used by benchmark operator for running the dpdk apps?
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
I am testing testpmd DPDK app as a benchmark on Intel NICs (XXV710) with OpenShift cluster 4.6. Upon running the benchmark as given below:
oc create -f dpdkapp.yaml
The benchmark gets created, the testpmd pod keeps restarting by going into crashloopback, error states. Whereas, trex pod runs for few seconds then it goes into error state. The logs for trext pod are given below:
I have run these testpmd and trex separately as pods using their base images and they work fine. But my use case is to test the DPDK app as a benchmark using the operator. Help from the community will be highly appreciated.