F-Stack / f-stack

F-Stack is an user space network development kit with high performance based on DPDK, FreeBSD TCP/IP stack and coroutine API.
http://www.f-stack.org
Other
3.87k stars 899 forks source link

Multithreading support on F-Stack #834

Open RaduNichita opened 3 months ago

RaduNichita commented 3 months ago

Hey, I recently came across the F-Stack project and I am interested in finding out more about the multithreading support for it.

Is this something that you are considering to support in the future if somebody comes with a PR for it or is there any limitation in the current codebase that prevents F-Stack API from running on multiple threads?

AcTarjan commented 3 months ago

I have a same requirement for multithreading support

zhaozihanzzh commented 3 months ago

I think the limitation may come from the way of porting FreeBSD network stack. Many FreeBSD "syscalls" are called with the same struct thread *td , and many return values are put into the same position. For example, in file lib/ff_syscall_wrapper.c, there are many rc = curthread->td_retval[0].

RaduNichita commented 3 months ago

Do you think it would be possible to create a new td structure when a new thread is created?

In my PR #835, where I followed the comment from #430 for a new thread is created, I copy the pointer to the parent structure.

freak82 commented 3 months ago

IMO, there are 2 (or 3 depending on the use case) main advantages of F-stack versus standard Linux:

  1. You don't have kernel/user-space context switches and copying for the IO operations. Currently this can be mitigated in Linux using io_uring, though.
  2. You don't have shared data structures for the network traffic processing. The Linux uses single network stack to handle the traffic coming from all queues where each queue is usually handled on a separate CPU core. This single network stack, in my experience, becomes a bottleneck, due to the used shared data structures, in scenarios with high amount of traffic and/or packets.
  3. In scenarios, with lots of traffic and/or packets, it's in general faster to poll the NIC for packets than to work with interrupts. However, lots of NIC drivers implement some version of polling in addition to the interrupt handling.

So, if you want to use the F-stack from multiple threads this will remove the second advantage from the above list. The other two advantages can be mitigated to some extent in a standard Linux application, IMO.

I'm writing this because we use the FreeBSD stack from multiple threads but not in this way. We use separate instance of the stack for each thread i.e. each thread uses it's own network stack and shares (almost) nothing with the other threads (There are some lock-free queues for communication between the threads). However, in our version the DPDK layer is decoupled as a separate module and not glued to the FreeBSD stack, this allows the application to have the DPDK layer with N worker threads and each worker thread having each own instance of the FreeBSD stack. It's just an alternative design which allows (almost) linear scaling of the application with the number of CPU cores.

RaduNichita commented 3 months ago

Thanks for the answer @freak82. Do you use F-Stack in your project? If so, did you make the change you mentioned to have a FreeBSD Stack per each thread?

freak82 commented 3 months ago

Yes, we use the modified F-stack version in production currently but not for a server. We use it for transparent caching proxy. As I said the version is modified so that we instantiate a separate network stack (i.e. separate FreeBSD/F-stack network stack) per thread and there is no sharing between the threads aside from some lock-free queues used for message passing between threads. The key part is the separation between the DPDK and the FreeBSD stack because the DPDK must/need to be single instance for the whole application while the FreeBSD stack is used as a library instantiated in each thread separately.

RaduNichita commented 3 months ago

@freak82, do you think you can post the patch that you applied to the F-Stack to have one FreeBSD stack per each thread, please? I think it would be a great contribution to the community

freak82 commented 3 months ago

I need to ask my employer first. However, keep in mind that:

RaduNichita commented 3 months ago

Hey @freak82 .

Do you have any updates regarding posting the patches for the F-Stack / FreeBSD stack?

freak82 commented 3 months ago

I got green light about this from my employer. Here are the patches that we applied to the FreeBSD stack. f-stack-patches.tar.gz

Few notes:

  1. In general I highly doubt that the patches will be useful to anybody because they tailor the F-stack to our use case and in addition they turn the F-stack into library which expects relatively high amount glue code to be given by the application.
  2. There are functions added to the F-stack which are again specific for our use case.
  3. The library is built like a .so and then every thread in given application is suppose to dlopen it's own copy of the .so.
  4. The DPDK functionality is completely removed from the F-stack and moved as a layer to the application which uses the .so objects.
  5. You are supposed first to initialize each stack via ff_init_netstack
  6. Then you are supposed to initialize the interface via ff_veth_if_init.
  7. During runtime you are supposed to call ff_on_timer_tick with the declared frequency
  8. During runtime you can inject packets from your application to the stack via ff_veth_process_packets
  9. Every packet is supposed to be "allocated" via ff_mbuf_gethdr
  10. If there are segments they should be allocated and linked via ff_mbuf_get
RaduNichita commented 2 months ago

Hey @freak82,

Firstly, many thanks for publishing the patch about multithreading on F-Stack. I really appreciate this! I took some time to go through it to understand the changes.

Secondly, I tried applying the first patch (git apply x-stack.patch), but I got the following error (I've tried on the dev branch, v.1.21 and v1.23 release versions):

x-stack.patch:69: trailing whitespace.
    DEFAULT(pu->pru_setup_transparent_sockets, 
x-stack.patch:127: trailing whitespace.
static int x3me_socket(struct thread* td, 
x-stack.patch:128: trailing whitespace.
                       int domain, 
x-stack.patch:129: trailing whitespace.
                       int type, 
x-stack.patch:168: trailing whitespace.
    /* 
warning: build.sh has type 100644, expected 100755
error: patch failed: build.sh:5
error: build.sh: patch does not apply
error: patch failed: lib/ff_types.h:209
error: lib/ff_types.h: patch does not apply

I also tried running git apply --reject x-stack.patch and tried compiling the F-Stack library, but got the following error:

make: *** No rule to make target 'ff_types.c', needed by 'ff_types.o'.  Stop.

It seems that the ff_types.c file is missing from the series of three patches.

Do you think it would be possible to add it, please?

Furthermore, if I understood correctly, the DPDK is initialized as a separated entity, not in the ff_init_netstack function. Could you add some small example which shows how is binding between F-Stack and DPDK done, please?

Many thanks again! :pray:

freak82 commented 2 months ago

Will not have time for the DPDK exmples these days. Too much work, sorry. Here are the missing ff_type files. Don't have time now to recreate the patch correctly. Sorry, again. ff_types.tar.gz

You may ping me again at the end of this week - like Friday for the example code.

RaduNichita commented 2 months ago

Hey @freak82,

Do you think you can add the example code to show how the binding between F-Stack and DPDK can be done, please?

freak82 commented 2 months ago

At the end of the week probably and I'm not sure I'll have time even then but I'll try.

freak82 commented 2 months ago

source-code.tar.gz Here is some source code from our application. Few notes about it:

  1. It's not a working example - I just got the excerpts of the code from our application which operate with the F-stack library. The logic is intertwined with too many other things and I can't not give you working examples without sitting for a day or two and just doing that. I don't have this time currently, sorry.
  2. The lib.h and lib.cpp contain the logic for loading the library on given thread and the wrappers for the dlopened-functions. Note that you need to call init_on_this_thread on every worker thread with a separate copy of the f-stack library. You need to do this first before doing anything else with the wrapped functions.
  3. The pkt_processor.cpp contains excerpts of the functionality which calls the wrapped f-stack functions. There are few initialization routines init_.... There is the logic from the main application loop with the receiving and the send of the packets. The initialization routines use some custom allocations but you'll need to figure out the logic by yourself. The custom allocations are used in our application just because every heap allocation is done via custom memory allocators which work with huge pages underneath. You may throw all that and use the DPDP rte_malloc and the similar functions. Or for the initial tests you can use the glibc malloc and friends.
wangchong2023 commented 2 months ago

source-code.tar.gz Here is some source code from our application. Few notes about it:

  1. It's not a working example - I just got the excerpts of the code from our application which operate with the F-stack library. The logic is intertwined with too many other things and I can't not give you working examples without sitting for a day or two and just doing that. I don't have this time currently, sorry.
  2. The lib.h and lib.cpp contain the logic for loading the library on given thread and the wrappers for the dlopened-functions. Note that you need to call init_on_this_thread on every worker thread with a separate copy of the f-stack library. You need to do this first before doing anything else with the wrapped functions.
  3. The pkt_processor.cpp contains excerpts of the functionality which calls the wrapped f-stack functions. There are few initialization routines init_.... There is the logic from the main application loop with the receiving and the send of the packets. The initialization routines use some custom allocations but you'll need to figure out the logic by yourself. The custom allocations are used in our application just because every heap allocation is done via custom memory allocators which work with huge pages underneath. You may throw all that and use the DPDP rte_malloc and the similar functions. Or for the initial tests you can use the glibc malloc and friends.

@freak82 @RaduNichita Have you studied libuinet? It supports uinet_instance_t and libev, but I wonder if it supports multi-threading model?