cndp performance reducing when both ingress and egress traffic is there.

I have posted question for stack overflow for which I need theanswer.

https://stackoverflow.com/questions/76991150/port-rx-drop-in-intel-corporation-ethernet-controller-xl710-for-40gbe-qsfp?noredirect=1#comment135735741_76991150

I created a cndp forwarding application. And tested it on 40gbps NIC i found that the performance decreases when both ingress and egress traffic on the nic.

With only ingress traffic on the NIC i am able to get 40gbps but when both ingress and egress traffic on NIC is there i get at max 23gbps. I don't know the reason why it reduces so much.

"umem0": { "bufcnt": 400, "bufsz": 2, "mtype": "2MB", "regions": [ 400 ], "rxdesc": 4, "txdesc": 4, "cache": 256, "description": "UMEM Description 0" },

This is my umem configuration.

Any insights will be very helpful.

The umem section looks fine, but I see 400 regions which is very odd maybe a typo, could you please paste the complete jsonc file. If you use the '<>' button (add code) to insert the text it will make it a bit easier to read.

Also please post the CPU layout and the CPU type. In the cndp/tools/cpu_layout.py will display the layout of the cores on your system.

The CPU type you can get from cat /proc/cpuinfo only need the CPU model name or you can paste one of the core's info.

Please ignore the application part it was specific to the application. `{

"application": {
    "name": "cndpfwd",
    "description": "A simple packet forwarder for pktdev and xskdev",
    "ran-ip": "192.168.2.1",
    "ran-mac": "3c:fd:fe:9e:7e:34",
    "dnn-ip": "192.168.1.1",
    "dnn-mac": "3c:fd:fe:9e:7b:85",
    "upf-ran-mac": "3c:fd:fe:9e:7b:5c",  // the interface attachd to ran
    "upf-dnn-mac": "3c:fd:fe:9e:7b:5d",  // the interface attached to dnn
    "arp-reply-ran-interface": "ens259f0",
    "arp-reply-dnn-interface": "ens259f1"
},

"defaults": {
    "bufcnt": 16,
    "bufsz": 2,
    "rxdesc": 2,
    "txdesc": 2,
    "cache": 256,
    "mtype": "2MB"
},

"umems": {
    "umem0": {
        "bufcnt": 400,
        "bufsz": 2,
        "mtype": "2MB",
        "regions": [
            400
        ],
        "rxdesc": 4,
        "txdesc": 4,
        "cache": 256,
        "description": "UMEM Description 0"
    },
    "umem1": {
        "bufcnt": 400,
        "bufsz": 2,
        "mtype": "2MB",
        "regions": [
            400
        ],
        "rxdesc": 4,
        "txdesc": 4,
        "cache": 256,
        "description": "UMEM Description 0"
    }
},

"lports": {
    "ens259f0:0": {
        "pmd": "net_af_xdp",
        "qid": 0,
        "umem": "umem0",
        "region": 0,
        "description": "UPLINK port 0"
    },
    "ens259f0:1": {
        "pmd": "net_af_xdp",
        "qid": 1,
        "umem": "umem1",
        "region": 0,
        "description": "UPLINK port 0"
    },
    "ens259f1:0": {
        "pmd": "net_af_xdp",
        "qid": 0,
        "umem": "umem0",
        "region": 0,
        "description": "DOWNLINK port 0"
    },
    "ens259f1:1": {
        "pmd": "net_af_xdp",
        "qid": 1,
        "umem": "umem1",
        "region": 0,
        "description": "DOWNLINK port 0"
    },
},
// (O) Define the lcore groups for each thread to run
//     Can be integers or a string for a range of lcores
//     e.g. [10], [10-14,16], [10-12, 14-15, 17-18, 20]
// Names of a lcore group and its lcores assigned to the group.
// The initial group is for the main thread of the application.
// The default group is special and is used if a thread if not assigned to a group.
"lcore-groups": {
    "initial": [
        40
    ],
    "group0": [
        "41-43"
    ],
    "group1": [
        "44-46"
    ],
    "default": [
        "40-46"
    ]
},
// (O) Set of common options application defined.
//     The Key can be any string and value can be boolean, string, array or integer
//     An array must contain only a single value type, boolean, integer, string and
//     can't be a nested array.
//   pkt_api    - (O) Set the type of packet API xskdev or pktdev
//   no-metrics - (O) Disable metrics gathering and thread
//   no-restapi - (O) Disable RestAPI support
//   cli        - (O) Enable/Disable CLI supported
//   mode       - (O) Mode type [drop | rx-only], tx-only, [lb | loopback], fwd, tx-only-rx,
//                    acl-strict, acl-permissive, [hyperscan | hs]
//   uds_path   - (O) Path to unix domain socket to get xsk map fd
//   filename   - (O) path of filename to load
//   progname   - (O) function name in filename requires filname to be present
"options": {
    "pkt_api": "xskdev",
    "mode": "fwd",
    "cli": true
},
// List of threads to start and information for that thread. Application can start
// it's own threads for any reason and are not required to be configured by this file.
//
//   Key/Val   - (R) A unique thread name.
//                   The format is <type-string>[:<identifier>] the ':' and identifier
//                   are optional if all thread names are unique
//      group  - (O) The lcore-group this thread belongs to. The
//      lports - (O) The list of lports assigned to this thread and can not shared lports.
//      idle_timeout - (O) if non-zero use value is in milliseconds to detect idle state
//      intr_timeout - (O) number of milliseconds to wait on interrupt
//      description | desc - (O) The description
"threads": {
    "main": {
        "group": "initial",
        "description": "CLI Thread"
    },
    "fwd:0": {
        "group": "group0",
        "lports": [
            "ens259f0:0"
        ],
        "description": "UPLINK Thread 0"
    },
    "fwd:1": {
        "group": "group0",
        "lports": [
            "ens259f0:1"
        ],
        "description": "UPLINK Thread 1"
    },
    "fwd:3": {
        "group": "group1",
        "lports": [
            "ens259f1:0"
        ],
        "description": "DOWNLINK Thread 0"
    },
    "fwd:4": {
        "group": "group1",
        "lports": [
            "ens259f1:1"
        ],
        "description": "DOWNLINK Thread 1"
    },
}

}

vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz stepping : 1 microcode : 0xb000040 cpu MHz : 1197.349 cache size : 30720 KB physical id : 1

cores = [0, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13] sockets = [0, 1]

    Socket 0        Socket 1       
    --------        --------

Core 0 [0, 24] [12, 36]
Core 1 [1, 25] [13, 37]
Core 2 [2, 26] [14, 38]
Core 3 [3, 27] [15, 39]
Core 4 [4, 28] [16, 40]
Core 5 [5, 29] [17, 41]
Core 8 [6, 30] [18, 42]
Core 9 [7, 31] [19, 43]
Core 10 [8, 32] [20, 44]
Core 11 [9, 33] [21, 45]
Core 12 [10, 34] [22, 46]
Core 13 [11, 35] [23, 47]

I don't completely understand what port.rx drop signifies. The same drop i have face before in the past when i wrote the application in the XDP there also the bottleneck was 23Gbps at that time i made the assumption that bpf_redirect was the issue. because when running the application with XDP_TX I was able to get Line rate.

I tried add comments to the JSONC file, please read them inline.

"application": {
    "name": "cndpfwd",
    "description": "A simple packet forwarder for pktdev and xskdev",
    "ran-ip": "192.168.2.1",
    "ran-mac": "3c:fd:fe:9e:7e:34",
    "dnn-ip": "192.168.1.1",
    "dnn-mac": "3c:fd:fe:9e:7b:85",
    "upf-ran-mac": "3c:fd:fe:9e:7b:5c",  // the interface attachd to ran
    "upf-dnn-mac": "3c:fd:fe:9e:7b:5d",  // the interface attached to dnn
    "arp-reply-ran-interface": "ens259f0",
    "arp-reply-dnn-interface": "ens259f1"
},

"defaults": {
    "bufcnt": 16,
    "bufsz": 2,
    "rxdesc": 2,
    "txdesc": 2,
    "cache": 256,
    "mtype": "2MB"
},

// Note:
//   Converted the two UMEMs to a single UMEM with 4 regions of 100K buffers per region
//   The reason is if you are receiving packets on one umem and sending them to another umem
//   the packet must be copied between the two umems spaces which is a performance hit.
//
//   One reason for having two umems is the application wants to keep packet data isolated
//   to a given umem for security reasons. But if packets are sent from one umem to another
//   umem a copy must take place and performance will drop.
"umems": {
    "umem0": {
        "bufcnt": 400,
        "bufsz": 2,
        "mtype": "2MB",
        "regions": [
            100,
            100,
            100,
            100
        ],
        "rxdesc": 4,
        "txdesc": 4,
        "cache": 256,
        "description": "UMEM Description 0"
    }
},

// Note:
//   Changed to attach each port to a different regions 0-3 in the same umem0 space.
"lports": {
    "ens259f0:0": {
        "pmd": "net_af_xdp",
        "qid": 0,
        "umem": "umem0",
        "region": 0,
        "description": "UPLINK port 0"
    },
    "ens259f0:1": {
        "pmd": "net_af_xdp",
        "qid": 1,
        "umem": "umem0",
        "region": 1,
        "description": "UPLINK port 1"
    },
    "ens259f1:0": {
        "pmd": "net_af_xdp",
        "qid": 0,
        "umem": "umem0",
        "region": 2,
        "description": "DOWNLINK port 0"
    },
    "ens259f1:1": {
        "pmd": "net_af_xdp",
        "qid": 1,
        "umem": "umem0",
        "region": 3,
        "description": "DOWNLINK port 1"
    },
},
// (O) Define the lcore groups for each thread to run
//     Can be integers or a string for a range of lcores
//     e.g. [10], [10-14,16], [10-12, 14-15, 17-18, 20]
// Names of a lcore group and its lcores assigned to the group.
// The initial group is for the main thread of the application.
// The default group is special and is used if a thread if not assigned to a group.
"lcore-groups": {
    // Changed the lcore numbers to use the Hyper-Thread-0 on each core as that is my preference.
    // The lcores used here are on socket 1 and the NIC needs to be on a PCI bus attached to
    // socket 1 or NUMA 1 region for performance.
    //
    // Normally a PCI address starting with values between 00-7f are on socket 0 and 80-FF are
    // on socket 1. 01:00.0 Socket 0, 82:00.0 Socket 1. If the NIC and lcores used are on
    // different sockets a performance hit will occur as the CPU must access the other socket
    // across the socket-to-socket internal bus QPI.
    "initial": [
        12
    ],
    // Using 1 core per group, using 13-15 for group0 would mean the thread can run on any
    // lcore 13-15. I normally assign a single lcore per thread as sharing lcores between
    // threads can have a performance hit. When you assign the same group with multiple lcores
    // to multiple threads Linux will try to load balance these threads between the lcores.
    //
    // Normally because a thread is constantly running it will never leave an lcore and if more
    // than one of these threads is assigned to the same lcore only one of the threads will
    // run on that lcore due to Linux scheduling. In some case Linux could try to run the other
    // thread on the same lcore, but it requires the first thread to yield it's time.
    // 
    // If you need more lcores per group think about moving one NIC to the other PCI bus
    // attached to the other socket and then adjust the lcores. This can cause a performance
    // hit as it will need two UMEMs (one per socket) and if packets are passing between
    // the two NICs.
    //
    // The number of threads defined in this file is 4 forwarding threads, which means we only
    // need 4 lcores or 4 groups with a single lcore. 
    "group0": [
        "13"
    ],
    "group1": [
        "14"
    ],
    "group0": [
        "15"
    ],
    "group1": [
        "16"
    ],
    "default": [
        "13-16"
    ]
},
// (O) Set of common options application defined.
//     The Key can be any string and value can be boolean, string, array or integer
//     An array must contain only a single value type, boolean, integer, string and
//     can't be a nested array.
//   pkt_api    - (O) Set the type of packet API xskdev or pktdev
//   no-metrics - (O) Disable metrics gathering and thread
//   no-restapi - (O) Disable RestAPI support
//   cli        - (O) Enable/Disable CLI supported
//   mode       - (O) Mode type [drop | rx-only], tx-only, [lb | loopback], fwd, tx-only-rx,
//                    acl-strict, acl-permissive, [hyperscan | hs]
//   uds_path   - (O) Path to unix domain socket to get xsk map fd
//   filename   - (O) path of filename to load
//   progname   - (O) function name in filename requires filname to be present
"options": {
    "pkt_api": "xskdev",
    "mode": "fwd",
    "cli": true
},
// List of threads to start and information for that thread. Application can start
// it's own threads for any reason and are not required to be configured by this file.
//
//   Key/Val   - (R) A unique thread name.
//                   The format is <type-string>[:<identifier>] the ':' and identifier
//                   are optional if all thread names are unique
//      group  - (O) The lcore-group this thread belongs to. The
//      lports - (O) The list of lports assigned to this thread and can not shared lports.
//      idle_timeout - (O) if non-zero use value is in milliseconds to detect idle state
//      intr_timeout - (O) number of milliseconds to wait on interrupt
//      description | desc - (O) The description
"threads": {
    "main": {
        "group": "initial",
        "description": "CLI Thread"
    },

    // Changed the threads to use different groups for performance reasons. You normally
    // do not want two threads trying to use the same lcore. In your case you assigned two
    // threads to the same group, which kind of works. Linux will schedule the two on one
    // of the lcores in the group (you have 3 lcores) and the third lcore would be unused.
    //
    // If you used the idle mode idle_timeout and intr_timeout it would allow more than one
    // thread to share a single lcore or group of lcores. If the number of threads needing to
    // at the same time exceeds the number of lcores then those threads would not run until
    // the Linux scheduler determines a thread has given up it slice of time.
    "fwd:0": {
        "group": "group0",
        "lports": [
            "ens259f0:0"
        ],
        "description": "UPLINK Thread 0"
    },
    "fwd:1": {
        "group": "group1",
        "lports": [
            "ens259f0:1"
        ],
        "description": "UPLINK Thread 1"
    },
    "fwd:3": {
        "group": "group2",
        "lports": [
            "ens259f1:0"
        ],
        "description": "DOWNLINK Thread 0"
    },
    "fwd:4": {
        "group": "group3",
        "lports": [
            "ens259f1:1"
        ],
        "description": "DOWNLINK Thread 1"
    },
}

Hi, keith

I tried with the updated configuration but didn't find any performance boost.

My setup is like this

For incoming data on ens259f0 packet is received on ens259f0:0 -- application process the packet and forwards to ---- ens259f1:0 packet is received on ens259f0:1 -- application process the packet and forwards to ---- ens259f1:1

For incoming data on ens259f1 packet is received on ens259f1:0 -- application process the packet and forwards to ---- ens259f0:0 packet is received on ens259f1:1 -- application process the packet and forwards to ---- ens259f0:1

Server 1 is the load generator and server 3 is the echo server. Server2 is application server The full packet path is like this server1->server2 ens259f0 -> server2ens259f1 -> server3 -> server2ens259f1-> server2ens259f0 -> server1 Screenshot 2023-08-30 120717

Load test results With only ingress traffic Screenshot 2023-08-30 115957

With both ingress and egress traffic Screenshot 2023-08-30 115735

Ethtool stats

Screenshot 2023-08-30 121258

With loopback mode i.e. server 2 acts as the echo sever but does the complete application processing+echo server processing. Screenshot 2023-08-30 120336

Updated configuration File `{

"application": {
    "name": "cndpfwd",
    "description": "A simple packet forwarder for pktdev and xskdev",
    "ran-ip": "192.168.2.1",
    "ran-mac": "3c:fd:fe:9e:7e:34",
    "dnn-ip": "192.168.1.1",
    "dnn-mac": "3c:fd:fe:9e:7b:85",
    "upf-ran-mac": "3c:fd:fe:9e:7b:5c",  // the interface attachd to ran
    "upf-dnn-mac": "3c:fd:fe:9e:7b:5d",  // the interface attached to dnn
    "arp-reply-ran-interface": "ens259f0",
    "arp-reply-dnn-interface": "ens259f1"
},

"defaults": {
    "bufcnt": 16,
    "bufsz": 2,
    "rxdesc": 2,
    "txdesc": 2,
    "cache": 256,
    "mtype": "2MB"
},

"umems": {
    "umem0": {
        "bufcnt": 512,
        "bufsz": 2,
        "mtype": "2MB",
        "regions": [
            128,
            128,
            128,
            128
        ],
        "rxdesc": 4,
        "txdesc": 4,
        "cache": 256,
        "description": "UMEM Description 0"
    }
},

"lports": {
    "ens259f0:0": {
        "pmd": "net_af_xdp",
        "qid": 0,
        "umem": "umem0",
        "region": 0,
        "description": "UPLINK port 0"
    },
    "ens259f0:1": {
        "pmd": "net_af_xdp",
        "qid": 1,
        "umem": "umem0",
        "region": 1,
        "description": "UPLINK port 0"
    },
    "ens259f1:0": {
        "pmd": "net_af_xdp",
        "qid": 0,
        "umem": "umem0",
        "region": 2,
        "description": "DOWNLINK port 0"
    },
    "ens259f1:1": {
        "pmd": "net_af_xdp",
        "qid": 1,
        "umem": "umem0",
        "region": 3,
        "description": "DOWNLINK port 0"
    },
},

"lcore-groups": {
    "initial": [
        10
    ],
    "group0": [
        11
    ],
    "group1": [
        12
    ],
    "group2": [
        13
    ],
    "group3": [
        14
    ],
    "default": [
        "10-16"
    ]
},

"options": {
    "pkt_api": "xskdev",
    "mode": "fwd",
    "cli": true
},

"threads": {
    "main": {
        "group": "initial",
        "description": "CLI Thread"
    },
    "fwd:0": {
        "group": "group0",
        "lports": [
            "ens259f0:0"
        ],
        "description": "UPLINK Thread 0"
    },
    "fwd:1": {
        "group": "group1",
        "lports": [
            "ens259f0:1"
        ],
        "description": "UPLINK Thread 1"
    },
    "fwd:3": {
        "group": "group2",
        "lports": [
            "ens259f1:0"
        ],
        "description": "DOWNLINK Thread 0"
    },
    "fwd:4": {
        "group": "group3",
        "lports": [
            "ens259f1:1"
        ],
        "description": "DOWNLINK Thread 1"
    },
}

} `

What is the difference between the loopback mode and the fwd mode. There is too much performance difference nearly 2x.

From what i understood in forward mode we call (void)txbuff_add(txbuff[dst->lpid], pd->rx_mbufs[j]); which add the packet descriptor to tx_buffer and when there are enough packet collected to transmitted we do tx_flush_flsuh

tx_flush internally uses xskdev_tx_burst(buffer->info, (void **)buffer->pkts, npkts);

In loopback we simply do __tx_flush(pd, fwd->pkt_api, pd->rx_mbufs, n_pkts); which also internally calls xskdev_tx_burst(buffer->info, (void **)buffer->pkts, npkts);

So why there is so much performance difference in both modes??

The code i written for forward mode is similar to the example ` static int _fwd_test(jcfg_lport_t lport, struct fwd_info fwd) {

  struct fwd_port *pd = (struct fwd_port *)lport->priv_;
  struct create_txbuff_thd_priv_t *thd_private = (struct create_txbuff_thd_priv_t *)pd->thd->priv_;
  txbuff_t **txbuff;
  int n_pkts;
  if (!pd)
      CNE_ERR_RET("fwd_port passed in lport private data is NULL\n");

  txbuff = thd_private->txbuffs;

n_pkts = __rx_burst(fwd->pkt_api, pd, pd->rx_mbufs, fwd->burst);
if (n_pkts == PKTDEV_ADMIN_STATE_DOWN)
    return -1;

jcfg_data_t *data;
jcfg_list_t *lst;

data = (jcfg_data_t *)fwd->jinfo->cfg;
lst  = &data->lport_list;

int lport_count     = lst->sz;
int lport_index     = lport->lpid;
jcfg_lport_t *dst=    lport;

for (int j = 0; j < n_pkts; j++) {
    struct cne_net_hdr_lens hdr_lens = {};

    uint32_t packet_type = cne_get_ptype_custom(pd->rx_mbufs[j], &hdr_lens);
    switch (packet_type) {

    /* Process the downlink packet*/
    case CNE_PTYPE_L4_UDP: {

        int dst_lport_index = lport_index - lport_count / 2;
        dst   = (jcfg_lport_t *)(lst->list[dst_lport_index]);
        if (!dst)
            /* Cannot forward to non-existing port, so echo back on incoming interface
             */
            dst = lport;

       ########################do uplink application processing################################

       (void)txbuff_add(txbuff[dst->lpid], pd->rx_mbufs[j]);

    }
    /*Process the uplink packet*/
    case CNE_PTYPE_TUNNEL_GTPU: {
        int dst_lport_index = lport_index + lport_count / 2;
        dst   = (jcfg_lport_t *)(lst->list[dst_lport_index]);
        if (!dst)
            /* Cannot forward to non-existing port, so echo back on incoming interface */
            dst = lport;

      ########################do downlink application processing################################
      (void)txbuff_add(txbuff[dst->lpid], pd->rx_mbufs[j]);

     }
}

while (txbuff_count(txbuff[dst->lpid]) > 0)
    txbuff_flush(txbuff[dst->lpid]);

return n_pkts;

}

static int _loopback_test(jcfg_lport_t lport, struct fwd_info fwd) {

struct fwd_port *pd = (struct fwd_port *)lport->priv_;
int n_pkts, n;

if (!pd)
    CNE_ERR_RET("fwd_port passed in lport private data is NULL\n");

n_pkts = __rx_burst(fwd->pkt_api, pd, pd->rx_mbufs, fwd->burst);
if (n_pkts == PKTDEV_ADMIN_STATE_DOWN)
    return -1;

if (n_pkts) {
        for (int j = 0; j < n_pkts; j++) {
    struct cne_net_hdr_lens hdr_lens = {};

    uint32_t packet_type = cne_get_ptype_custom(pd->rx_mbufs[j], &hdr_lens);
    switch (packet_type) {

    /* Process the downlink packet*/
    case CNE_PTYPE_L4_UDP: { 
            ########################do uplink application processing################################
    }
    /*Process the uplink packet*/
    case CNE_PTYPE_TUNNEL_GTPU: {
               ########################do uplink application processing################################            
    }
}

n = __tx_flush(pd, fwd->pkt_api, pd->rx_mbufs, n_pkts);
if (n == PKTDEV_ADMIN_STATE_DOWN)
    return -1;
pd->tx_overrun += n;
}
return n_pkts;

As per the performance number if fwd mode is used as loopback mode then there is no performance difference Here is screenshot of fwd mode being used as forward mode. fwd from ens259f0:0 --> ens259f0:1 Screenshot 2023-08-30 134153

Loopback performance is same as this.

But when I do fwd from ens259f0:0 --> ens259f1:0 the performance is impacted serverly.

My NIC information is listed below 07:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02) 07:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)

What is the difference between the loopback mode and the fwd mode. There is too much performance difference nearly 2x.

loopback mode loops the packet back on the same port and umem.

fwd mode forwards the packet between 2 ports. if the UMEM is not shared between the ports there's a copy that needs to happen from 1 umem to another which has a performance impact.

Also looking at your NIC info they are attached to numa node 0. so please make sure you are using socket 0 cores for your application

Maryam is correct here the lcores used for a NIC must be on the same NUMA zone or Socket. In your case the NICs are on Socket 0 or NUMA 0, this is where your performance hit is taking place. I did state this in the config file I crafted, must have missed those comments.

Hi, Keith i did took care of the numa zone in my new forwarding file as per your suggestion but the performance didn't increase.

Screenshot 2023-08-31 000740 Screenshot 2023-08-31 000811

Updated fwd.jsonc

"lcore-groups": {
    "initial": [
        40
    ],
    "group0": [
        36
    ],
    "group1": [
        37
    ],
    "group2": [
        38
    ],
    "group3": [
        39
    ],
    "default": [
        "36-40"
    ]
},

Screenshot 2023-08-31 002048

` "lports": {

    "ens259f0:0": {
        "pmd": "net_af_xdp",
        "qid": 36,
        "umem": "umem0",
        "region": 0,
        "description": "UPLINK port 0"
    },
    "ens259f0:1": {
        "pmd": "net_af_xdp",
        "qid": 37,
        "umem": "umem0",
        "region": 1,
        "description": "UPLINK port 0"
    },
    "ens259f1:0": {
        "pmd": "net_af_xdp",
        "qid": 38,
        "umem": "umem0",
        "region": 2,
        "description": "DOWNLINK port 0"
    },
    "ens259f1:1": {
        "pmd": "net_af_xdp",
        "qid": 39,
        "umem": "umem0",
        "region": 3,
        "description": "DOWNLINK port 0"
    },
},

But i didn't have any effect on the application performance also increase the number of cores also does not have any effect. Like total number of packet processes when using 2 core is 20million pps and when using 3 cores is also 20million pps. This is 1400B packet size. Screenshot 2023-08-31 005551

However for small packets size i am able to get a more pps So definately memory is the bottleneck somewhere.

Screenshot 2023-08-31 004947

But my confustion is that the application work fine in loopback mode but when using fwd mode there is a performance drop Screenshot 2023-08-31 005435

I must be doing something wrong but couldn't find what all the test leads to explanation that i should use loopback mode but i am sure why.

Thank you for all the help till now ❤️

My NIC information is listed below
07:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
07:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)

In the text above it shows the PCI address to start with 07, which normally denotes Socket 0 and NUMA 0. The cat of numa_node states 1, which is odd.

Did you try it on socket 0 lcores 2-11? One more thing I noticed is you have the threads in lcore-group:group[0-3] on lcores 36-39, but the umem:qid are also mapped to the same lcores 36-39. Normally we make sure the qids and threads run on different lcores. The reason for this is the kernel interrupt thread is running on the same application thread which means they must share the same core.

Keep the application threads on 36-39 and move the QIDs to 41-44. lcore 40 is still the main thread, if you are not doing any real work on 40, you can move it to be on socket 0 say lcore 2.

With the new configuration it has one UMEM space with 4 regions, which does not require a packet copy between ports on different UMEMs as it was originally.

The loopback mode receives the packet from a port then swaps the DST/SRC MAC addresses and sends the packet back to the port. For forwarding mode (not to be confused with l3Fwd mode) will look at the DST MAC address octet 5 or the least significant byte and uses that value as the port id to send the packet. This means that byte must be between 0-3 in this configuration. The MAC swap is also performed on these packets as the NIC will not send a packet if the DST MAC is the same as the port it is sending the packet.

Please verify the example cndpfwd in loopback and fwd mode gets the same results. It does mean the DST MAC needs to be correct for fwd mode to work.

I assume you have not changed the two MAC addresses on the send/receive machines. We use a traffic generator and we can set any DST/SRC MAC address, but it requires the NICs on the DUT to be in promiscuous mode.

For 1400 byte frames you could be getting close to wirerate and that is why it stops are 20Mpps (did not do the math). Make sure you are not hitting wirerate for the interface speed. For 64 byte frames it depends on the CPU normally. On my machine with a 40Gbit NIC I get about 23Mpps as that is the max rate the CPU can receive and resend a packet as long as no copy of the packet is needed or some other bottleneck in the system.

Also I noticed in the last output of the screen it shows 6 ports not 4 as before, changing configurations is going to be hard to debug when I expected only 4 ports.

BTW, I use lstopo command to graphically show which socket my NICs are located and which PCI bus. Some machines only have one PCI bus, instead of one per socket and the one is attached to socket 1, not sure why. This may explain why you have PCI address 07:00.0 and the NIC is on socket 1 or NUMA 1.

I am very sorry for that confusion I had 2 NICs on my machine and had given the address of other NICs. HostBridge PCIBridge PCI 81:00.0 (Ethernet) Net "ens259f0" PCI 81:00.1 (Ethernet) Net "ens259f1"

ah, that makes more sense :-)

Thank you for all the comments that were very helpful. I think issue lies somewhere else maybe or I did something wrong in the application code. For now I will work with the loopback mode and revisit it later when I more understanding of the CNDP internals.

Thanks again for the all the help 💌

KR Lakshya

CloudNativeDataPlane / cndp

cndp performance reducing when both ingress and egress traffic is there. #339