cloud-bulldozer / benchmark-operator

The Chuck Norris of cloud benchmarks
Apache License 2.0
285 stars 129 forks source link

Testpmd benchmark fails: Testpmd pod keeps crashing and trex pod runs for few seconds then goes in error state #640

Closed MuhammadMunir12 closed 2 years ago

MuhammadMunir12 commented 3 years ago

I am testing testpmd DPDK app as a benchmark on Intel NICs (XXV710) with OpenShift cluster 4.6. Upon running the benchmark as given below:

oc create -f dpdkapp.yaml

apiVersion: ripsaw.cloudbulldozer.io/v1alpha1
kind: Benchmark
metadata:
  name: testpmd-benchmark
  namespace: my-ripsaw
spec:
  clustername: mycluster
  workload:
    name: testpmd
    args:
      privileged: true
      pin: true
      pin_testpmd: "worker1"
      pin_trex: "worker1"
      networks:
        testpmd:
          - name: testpmd-sriov-network
            count: 2  # Interface count, Min 2
        trex:
          - name: testpmd-sriov-network
            count: 2

The benchmark gets created, the testpmd pod keeps restarting by going into crashloopback, error states. Whereas, trex pod runs for few seconds then it goes into error state. The logs for trext pod are given below:

$ oc logs -f trex-traffic-gen-pod-1760b18a-z54q9
2021-08-12T10:12:30Z - INFO     - MainProcess - run_snafu: logging level is INFO
2021-08-12T10:12:30Z - INFO     - MainProcess - _load_benchmarks: Successfully imported 1 benchmark modules: uperf
2021-08-12T10:12:30Z - INFO     - MainProcess - _load_benchmarks: Failed to import 0 benchmark modules:
2021-08-12T10:12:30Z - INFO     - MainProcess - run_snafu: Not connected to Elasticsearch
2021-08-12T10:12:30Z - INFO     - MainProcess - wrapper_factory: identified trex as the benchmark wrapper
2021-08-12T10:12:30Z - INFO     - MainProcess - trigger_trex: Starting TRex Traffic Generator..
Traceback (most recent call last):
  File "/usr/local/bin/run_snafu", line 33, in <module>
    sys.exit(load_entry_point('snafu', 'console_scripts', 'run_snafu')())
  File "/opt/snafu/snafu/run_snafu.py", line 166, in main
    for i in process_generator(index_args, parser):
  File "/opt/snafu/snafu/run_snafu.py", line 194, in process_generator
    for action, index in data_object.emit_actions():
  File "/opt/snafu/snafu/trex_wrapper/trigger_trex.py", line 68, in emit_actions
    documents = self._json_payload(stdout)
  File "/opt/snafu/snafu/trex_wrapper/trigger_trex.py", line 33, in _json_payload
    payload = json.loads(data)
  File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I have run these testpmd and trex separately as pods using their base images and they work fine. But my use case is to test the DPDK app as a benchmark using the operator. Help from the community will be highly appreciated.

jtaleric commented 3 years ago

I have asked @mukrishn to check this out, but looking at the log, it would seem the issue is that you didn't provide a Elasticsearch URL. This would be a bug, as we shouldn't require a ES URL to be passed. If my assumptions are correct.

To verify this, you could use our development ES instance to try this. https://search-perfscale-dev-chmf5l4sh66lvxbnadi4bznl3a.us-west-2.es.amazonaws.com:443

MuhammadMunir12 commented 3 years ago

@jtaleric That is not requirement for running this as benchmark btw, may be it's required in Uperf. But i'm not sure about it.

MuhammadMunir12 commented 3 years ago

@jtaleric I have tried running on your test cluster but facing the same issue. The details are given below:

apiVersion: ripsaw.cloudbulldozer.io/v1alpha1
kind: Benchmark
metadata:
  name: testpmd-benchmark
  namespace: my-ripsaw
spec:
  elasticsearch:
    url: "https://search-perfscale-dev-chmf5l4sh66lvxbnadi4bznl3a.us-west-2.es.amazonaws.com:443"
  workload:
    name: testpmd
    args:
      privileged: true
      pin: true
      pin_testpmd: "worker1"
      pin_trex: "worker1"
      networks:
        testpmd:
          - name: testpmd-sriov-network
            count: 2  # Interface count, Min 2
        trex:
          - name: testpmd-sriov-network
            count: 2

Pods (testpmd and trex) goes in error state. Logs are attached below:

Testpmd pod

Events:
  Type     Reason          Age                            From               Message
  ----     ------          ----                           ----               -------
  Normal   Scheduled       <invalid>                      default-scheduler  Successfully assigned my-ripsaw/testpmd-application-pod-ae0f8c98-28wnk to r192bc2.oss.labs
  Normal   AddedInterface  <invalid>                      multus             Add eth0 [10.131.1.111/23]
  Normal   AddedInterface  <invalid>                      multus             Add net1 [10.57.1.123/24] from my-ripsaw/testpmd-sriov-network
  Normal   AddedInterface  <invalid>                      multus             Add net2 [10.57.1.124/24] from my-ripsaw/testpmd-sriov-network
  Normal   Pulled          <invalid>                      kubelet            Successfully pulled image "registry.redhat.io/openshift4/dpdk-base-rhel8:v4.6" in 1.099241488s
  Normal   Pulling         <invalid> (x2 over <invalid>)  kubelet            Pulling image "registry.redhat.io/openshift4/dpdk-base-rhel8:v4.6"
  Normal   Created         <invalid> (x2 over <invalid>)  kubelet            Created container testpmd
  Normal   Started         <invalid> (x2 over <invalid>)  kubelet            Started container testpmd
  Normal   Pulled          <invalid>                      kubelet            Successfully pulled image "registry.redhat.io/openshift4/dpdk-base-rhel8:v4.6" in 1.050182689s
  Warning  BackOff         <invalid> (x2 over <invalid>)  kubelet            Back-off restarting failed container

Trex pod

$ oc logs -f trex-traffic-gen-pod-ae0f8c98-7k8xm
2021-08-12T12:42:31Z - INFO     - MainProcess - run_snafu: logging level is INFO
2021-08-12T12:42:31Z - INFO     - MainProcess - _load_benchmarks: Successfully imported 1 benchmark modules: uperf
2021-08-12T12:42:31Z - INFO     - MainProcess - _load_benchmarks: Failed to import 0 benchmark modules:
2021-08-12T12:42:31Z - INFO     - MainProcess - run_snafu: Using elasticsearch server with host: https://search-perfscale-dev-chmf5l4sh66lvxbnadi4bznl3a.us-west-2.es.amazonaws.com:443
2021-08-12T12:42:31Z - INFO     - MainProcess - run_snafu: Using index prefix for ES: ripsaw-testpmd
2021-08-12T12:42:31Z - INFO     - MainProcess - run_snafu: Turning off TLS certificate verification
2021-08-12T12:42:31Z - INFO     - MainProcess - run_snafu: Connected to the elasticsearch cluster with info as follows:
2021-08-12T12:42:32Z - INFO     - MainProcess - run_snafu: {
    "name": "510fddd9ea3242aefad127567cffc68e",
    "cluster_name": "415909267177:perfscale-dev",
    "cluster_uuid": "Xz2IU4etSieAeaO2j-QCUw",
    "version": {
        "number": "7.10.2",
        "build_flavor": "oss",
        "build_type": "tar",
        "build_hash": "unknown",
        "build_date": "2021-05-21T20:25:46.519671Z",
        "build_snapshot": false,
        "lucene_version": "8.7.0",
        "minimum_wire_compatibility_version": "6.8.0",
        "minimum_index_compatibility_version": "6.0.0-beta1"
    },
    "tagline": "You Know, for Search"
}
2021-08-12T12:42:32Z - INFO     - MainProcess - py_es_bulk: Using streaming bulk indexer
2021-08-12T12:42:32Z - INFO     - MainProcess - wrapper_factory: identified trex as the benchmark wrapper
2021-08-12T12:42:32Z - INFO     - MainProcess - trigger_trex: Starting TRex Traffic Generator..
Traceback (most recent call last):
  File "/usr/local/bin/run_snafu", line 33, in <module>
    sys.exit(load_entry_point('snafu', 'console_scripts', 'run_snafu')())
  File "/opt/snafu/snafu/run_snafu.py", line 139, in main
    es, process_generator(index_args, parser), parallel_setting
  File "/opt/snafu/snafu/utils/py_es_bulk.py", line 171, in streaming_bulk
    for ok, resp_payload in streaming_bulk_generator:
  File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 320, in streaming_bulk
    actions, chunk_size, max_chunk_bytes, client.transport.serializer
  File "/usr/local/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 155, in _chunk_actions
    for action, data in actions:
  File "/opt/snafu/snafu/utils/py_es_bulk.py", line 117, in actions_tracking_closure
    for cl_action in cl_actions:
  File "/opt/snafu/snafu/run_snafu.py", line 194, in process_generator
    for action, index in data_object.emit_actions():
  File "/opt/snafu/snafu/trex_wrapper/trigger_trex.py", line 68, in emit_actions
    documents = self._json_payload(stdout)
  File "/opt/snafu/snafu/trex_wrapper/trigger_trex.py", line 33, in _json_payload
    payload = json.loads(data)
  File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
mukrishn commented 3 years ago

@MuhammadMunir12 so you mentioned that your testpmd pod kept restarting and got in to crashloopback that worries me much, if testpmd pod isn't running trex would definitely fail. Could you paste the error from testpmd pod? oc logs testpmd-application-pod-ae0f8c98-28wnk I am assuming the pods are pinned to a different worker node although your referred CR has same, just checking because we have the pod Antiaffinity check here

MuhammadMunir12 commented 3 years ago

@mukrishn I have pinned them on different worker nodes this time but still the issue is same. Logs for testpmd pod are attached. I am using default values for testpmd as mentioned in operator's guide.

$ oc logs -f testpmd-application-pod-f28e6493-m4t2t
EAL: Detected 72 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Auto-detected process type: PRIMARY
EAL: Multi-process socket /tmp/dpdk/pg/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:d8:0a.2 on NUMA socket 1
EAL:   probe driver: 8086:154c net_i40e_vf
EAL: PCI device 0000:d8:0a.3 on NUMA socket 1
EAL:   probe driver: 8086:154c net_i40e_vf
testpmd: No probed ethernet devices
Fail: input rxq (1) can't be greater than max_rx_queues (0) of port 0
EAL: Error - exiting with code: 1
  Cause: rxq 1 invalid - must be >= 0 && <= 0
jtaleric commented 3 years ago

@jtaleric That is not requirement for running this as benchmark btw, may be it's required in Uperf. But i'm not sure about it.

It shouldn't be a requirement. However, some things do slip through the cracks. However, it would be good to catch this issue, vs drop a traceback.

mukrishn commented 3 years ago

@MuhammadMunir12 testpmd itself failed and need to fix it here, from the log you are using interface from the NUMA socket 1 and i believe the hugepages are created on the same socket as well but you haven't provided the socket_memory for the testpmd app on the right socket, you can mention it as part of the CR(by default script assigns this to socket 0 only,)

  workload:
    args:
      socket_memory: 0,1024 # its in the socket order, default would be 1024,0

More config params are here.

Also I would like to see your SRIOV policy and PAO profile, just to double check if everything is configured as per the doc.

rsevilla87 commented 3 years ago

Hi, I filed https://github.com/cloud-bulldozer/benchmark-wrapper/pull/322 to help debugging the json parsing issues. As soon as it gets merged, mind you try running the workload again adding debug: true in the Benchmark YAML?

MuhammadMunir12 commented 3 years ago

@mukrishn I have added socket_memory from Numa 1, then tried other parameters like:

memory_channels: 4
    forwarding_cores: 4
    rx_queues: 1
    tx_queues: 1
    rx_descriptors: 1024
    tx_descriptors: 1024
    forward_mode: "mac"
    stats_period: 1
    disable_rss: true

But still the issue is still there.

mukrishn commented 3 years ago

@MuhammadMunir12 could you share the SRIOV and PAO policies you used ?

MuhammadMunir12 commented 3 years ago

@rsevilla87 Kindly verify if I've added debug flag at the right place. With this CR, the issue still persists.

$ cat dpdk-app.yaml
apiVersion: ripsaw.cloudbulldozer.io/v1alpha1
kind: Benchmark
metadata:
  name: testpmd-benchmark
  namespace: my-ripsaw
spec:
  clustername: r192b
  workload:
    name: testpmd
    debug: true
    args:
      socket_memory: 0,1024
      memory_channels: 4
      forwarding_cores: 4
      rx_queues: 1
      tx_queues: 1
      rx_descriptors: 1024
      tx_descriptors: 1024
      forward_mode: "mac"
      stats_period: 1
      disable_rss: true
      privileged: true
      pin: true
      pin_testpmd: "r192bc2.oss.labs"
      pin_trex: "r192bmw.oss.labs"
      networks:
        testpmd:
          - name: testpmd-sriov-network
            count: 2  # Interface count, Min 2
        trex:
          - name: testpmd-sriov-network
            count: 2
MuhammadMunir12 commented 3 years ago

@mukrishn The DPDK node policy is:

$ cat intel-dpdk-node-policy.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: intel-dpdk-node-policy-for-testpmd
  namespace: openshift-sriov-network-operator
spec:
  resourceName: intelnics
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  numVfs: 5
  nicSelector:
    pfNames: ["ens3f1"]
    rootDevices: ["0000:d8:00.1"]
  deviceType: netdevice
  isRdma: false

The PAO manifest is:

$ cat pao.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-performance-addon-operator
  labels:
    openshift.io/run-level: "1"
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: openshift-performance-addon-operator
  namespace: openshift-performance-addon-operator
spec:
  targetNamespaces:
  - openshift-performance-addon-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-performance-addon-operator-subscription
  namespace: openshift-performance-addon-operator
spec:
  channel: "4.6"
  name: performance-addon-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace

The performance profile is:

$ cat perf-profile.yaml
apiVersion: performance.openshift.io/v1
kind: PerformanceProfile
metadata:
  name: r192b-performanceprofile
spec:
  additionalKernelArgs:
    - nmi_watchdog=0
    - audit=0
    - mce=off
    - processor.max_cstate=1
    - idle=poll
    - intel_idle.max_cstate=0
  cpu:
    isolated: "2-35,38-71"
    reserved: "0,1,36,37"
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - size: "1G"
      count: 16
  realTimeKernel:
    enabled: true
  numa:
    topologyPolicy: "restricted"
    # topologyPolicy: "best-effort" # May change performance, but this can be used
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

The only strange thing is that I have used "netdevice" driver with Intel NICs because "vfio-pci" was not creating VFs.

rsevilla87 commented 3 years ago

@MuhammadMunir12, debug: true is nested under args

MuhammadMunir12 commented 3 years ago

@rsevilla87 That doesn't change anything with debug flag set to true.

mukrishn commented 3 years ago

@MuhammadMunir12 It has to be vfio-pci for Intel XXV710 cards, as you can see the testpmd application tries to load VFIO modules. We have documented in SR-IOV configuration for different cards.

Also make sure you have Hugepages allocated on NUMA Socket 1 on the worker nodes. cat /sys/devices/system/node/node*/meminfo | fgrep Huge

rsevilla87 commented 3 years ago

@rsevilla87 That doesn't change anything with debug flag set to true.

Hey, I just merged the debug patch, can you try again?

MuhammadMunir12 commented 3 years ago

@mukrishn Using vfio-pci for node policy doesn't create VFs on the given interface. And yes hugepages are allocated to both Numa sockets.

Node 0 HugePages_Total:     8
Node 0 HugePages_Free:      8
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     8
Node 1 HugePages_Free:      8
Node 1 HugePages_Surp:      0
MuhammadMunir12 commented 3 years ago

@rsevilla87 Thank you. I have checked but it's not working. As @mukrishn mentioned that for Intel driver must be vfio-pci but with this driver I'm unable to create VFs using node policy, VFs are created only when I use netdevice driver type which is for Mellanox NICs. I don't understand why it's behaving like this.

mukrishn commented 3 years ago

@MuhammadMunir12 May be you can try, netdevice with isRdma: true.

MuhammadMunir12 commented 3 years ago

@mukrishn That can be tried, but it's set to true in Mellanox NICs for DPDK.

MuhammadMunir12 commented 3 years ago

@mukrishn Didn't work on netdevice withisRdma: true.

@jtaleric Any updates at your end regarding this issue?

MuhammadMunir12 commented 3 years ago

@jtaleric Can we see which dpdk version is used by benchmark operator for running the dpdk apps?

mukrishn commented 3 years ago

@MuhammadMunir12 testpmd pod image - registry.redhat.io/openshift4/dpdk-base-rhel8:v4.6, it runs dpdk-19.11-5.el8_2.x86_64, you can also find the packages here You can also use your own image - ref

stale[bot] commented 3 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.