ofp_send / ofp_sendto poor performance

gabeblack commented 4 years ago

With ofp sitting on top of odp_dpdk, the performance of ofp_send /ofp_sendto is pretty poor (UDP). In a while loop running nothing but ofp_send, the performance caps out at about 110Kpps.

This while loop is running in its own thread, but was not spawned with odp_thread api, but did run the odp/ofp local thread init in order to be able to use the ofp fastpath apis. The ODP/OFP has two dispatch threads running on their own cores.

Is the ofp_send/to family of APIs supposed to not be part of the fast path? i.e. is it on the slow path? Are the pktio interfaces the only ones supposed to be fast? Just curious as to what I might be doing wrong.

Using vanilla dpdk on the same NIC, 4-5Mpps is achievable with little effort or tuning.

Why not use plain DPDK? Was hoping to make use of OFP networking stack capabilities.... Rather not have to populate layer 2-3-4 headers and perform arp resolution, etc etc.

bogdanPricope commented 4 years ago

Hi,

I don't have exact numbers but it looks low (maybe Matias has some results).
As you suspect, socket API is not accelerated: simply putted, it requires at least one memcpy() and lots of locks to send or receive data. Instead, other mechanism were consider:
- 'zero-copy' and packet hooks are designed to have direct access to the information from received packets (to the actual packet).
- on send part, probably ofp_udp_pkt_sendto() may be used.

But main usecase for OFP was packet processing and not generator: receive a packet, process it, send it.

If a pure packet generator is your usecase then maybe odp_generator from odp_dpdk will be more useful.

Design of your application has a lot of impact: there are control threads and worker threads.

Worker threads can receive packets (access to interface RX queues) and send packets (access to TX queues). Control threads can receive data through socket API and send packets (access to TX queues).

Now, when you are running ofp_send/ofp_sendto in a loop (in a control thread), the traffic is not using the workers: it is just one (your) thread using one TX queue. If your thread happens to run on the same core as a worker thread (that tries to use 100% of the core), you will get poor performance. So, a little bit of thread per core planing is needed.

OFP configuration and network card capabilities are important:
- using scheduled workers (fpm example) vs. direct mode (udp_fwd_socket example) + 1 scheduled worker
- number of RX and TX queues available: if threads have to share queues then locks are needed.
- specific OFP configuration flags (like burst size), etc.

Usually when planning to use OFP, you have to start with filling a list:

What CPU architecture: x86 64 or arm, etc.
How many cores can/want to use
Capabilities of the network card: number of TX and RX queues, offload capabilities.
what kind of processing and how to split it per cores.

So, tell us more ...

Merci, Bogdan

gabeblack commented 4 years ago

Hi Bogdan,

Thank you for the detailed response. I was able to improve the performance 5X changing the NIC (mellanox card) to not use the igb_uio driver since it seems DPDK has direct support for the NIC via ibverbs and the mlx poll mode driver. However performance is still well under what we were hoping to achieve.

Anyway, I definitely ensured the send thread was not using the same threads as the worker threads as they were bound to different cores. Initially the control thread was on cpu 0 (I think the default core for control threads), but I put it on cpu 1 since I think a lot of linux processes use cpu 0 for handling interrupts and other things. I think that stuff was negligable because it didn't matter whether I ran on cpu 1 or 0. The worker threads I put on 2 and 3, but it doesn't matter which ones it seems they run on (the system I was testing on has 16 cores). Since I was only doing transmit, I wasn't sure how useful to have more than one thread.

I think I understand the purpose of the hook, but was hoping to avoid that as it seems like it is sort of global hook for the port. Meaning, I could have several udp sockets open sending to different destinations, but all flows would go through the same hook, so that I wouldn't really know which packet belonged to which flow without doing some packet parsing... Seems like there is added complexity there.

Anyway, if there are some benchmarks you might have, especially with the socket api, that would be very useful to know what is possible or what one might expect to be able to achieve.

bogdanPricope commented 4 years ago

Hi,

You can try this:

Put on core 0 the worker thread: since you are doing mostly send operations it will not be used much (timers, etc. will run on it)
Core 1-N: create multiple control threads (one per core) with your send loop (different sockets per thread, bound, etc.).
For your configuration, probably it will make sense to increase OFP_PKT_TX_BURST_SIZE (include/api/ofp_config.h) to 16 or 32.
Hyperthreads (x86): use only one virtual core per processor core else you will see no increase in performance

Hooks are points where you can access packet as is processed by ofp: you can inspect the packet or take ownership of the packet and do what you want with the it. Alternatively, you can use "zero-copy" api.. that is basically a hook per socket: see example/udp_fwd_socket/udp_fwd_socket.c line 63

That are many optimizations possible (e.g: use multiple TX queues (one per used core) without Multithread safe) but above points should improve performance. Maybe I'll try myself this scenario on my setup.

Btw, if DUT and packet destination are in the same network you can add a direct route. e.g. in CLI:

route add 192.168.200.20/32 gw 192.168.200.20 dev fp1

Merci, Bogdan

bogdanPricope commented 4 years ago

So I made a setup as described above. I changed SHM_PKT_POOL_NB_PKTS to 102400 I've added route and static arp

With OFP_PKT_TX_BURST_SIZE == 1 I am getting: 1 sendloop: 1.763 mpps 2 sendloop: 2.391 mpps 3 sendloop: 2.583 mpps

With OFP_PKT_TX_BURST_SIZE == 16 I am getting: 1 sendloop: 1.827 mpps 2 sendloop: 2.850 mpps 3 sendloop: 4.354 mpps

And this is with regular socket API (ofp_sendto() ).... and without multiple TX queues, etc. I am using a couple of 82599ES connected through DAC on a setup with two I5 (Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz).

iufl commented 4 years ago

Hi,

My first idea would be or you to test udpecho and udp_fwd_socket examples, to check if the numbers are still low. We tested these before and we had better performance than the one you reported.

You can also set OFP_PKT_TX_BURST_SIZE to a higher value, such as 16, in case of line rate traffic and see if the numbers are getting better.

BR, /Iulia

OpenFastPath / ofp

ofp_send / ofp_sendto poor performance #252