F-Stack Send Zero Copy Introduction

Original F-Stack FStack2022-04-17 21:16

The processing of data packets on the server is divided into two directions: receiving and sending. The receiving direction is because our own business scenario involves very little data received, which will be introduced later.

This article mainly introduces the current zero-copy processing scheme, effect and application scenario selection in the direction of F-Stack's packet sending. The data copying in the packet sending direction mainly consists of two stages, one is to copy the protocol stack data to DPDK , and the other is to apply When the layer calls the socket sending interface, it will copy the data from the application layer to the FreeBSD protocol stack, which will be introduced separately below.rte_mbuf

Protocol stack to DPDK The zero-copy implementation of this process was merged into the F-Stack mainline by Pull Request #364 submitted by @jinhao2 . For the relevant implementation details, please refer to the relevant code. Only a brief introduction to the implementation scheme is given here.

An Introduction When the process is initialized, the memory of the specified size is allocated for the BSD stack through mmap (currently the default is 256M), and the default configuration can be modified through the parameters in .config.inimemsz_MB

The physical memory is fixed by mlock() to prevent the change of the correspondence between the virtual address and the physical address of the memory caused by being swapped out to the swap partition.

Calculate and save the starting address of each page, including virtual address and physical address. The calculation of physical address can be performed through the relevant interface provided by DPDK.

Initializes a stack structure to manage all allocated pages.

Replacing the , while the BSD stack is called to acquire a page of memory.ff_mmap()/ff_munmap()kmem_malloc()/kmem_free()ff_mmap()/ff_munmap()

When assigning the data address of the BSD protocol stack to the DPDK, it is used to judge whether it is the address in the memory pool requested for initialization, and to find the corresponding physical address through the virtual address, and assign them to the structure respectively , without actually copying the memory. .mbufrte_mbufrte_bufbuf_addr/buf_physaddr

Use a circular queue to hold the sent pointers, the length of the queue should be the same as that of the NIC . Before an item in the queue is pushed to the new value, the old one must be processed by the NIC and can be safely released.mbuftx_queue_lengthmbuf

If it is a type, which includes one , indicating that it is a zero-copy additional data address when receiving a packet, use instead .mbufext_clusterrte_mbufrte_pktmbuf_clone()

Usage and precautions How to use This function is not enabled by default , and it will take effect after recompiling the F-Stack lib library and application by turning on the compile option in .lib/MakefileFF_USE_PAGE_ARRAY

Other application programming and usage methods are no different from the conventional copy mode, and are transparent to the application layer.

Precautions When the memory pool is initialized, it is passed and applied for by the process. It is the private address space of the process, and the related memory cannot be passed to other processes for use.mmapmlock

You can consider mapping large page memory or using shared memory during initialization (which also requires or locks memory to prevent swapping) to achieve the purpose of cross-process use, but the corresponding address storage and search structure also need to be changed. Generally, it is recommended to avoid cross-process use. The process can be used, and modification is not recommended.SHM_LOCkmlock

The zero-copy function from the protocol stack to DPDK can be enabled and used alone, or it can be enabled and used together with the zero-copy sending interface .FF_USE_PAGE_ARRAYFF_ZC_SEND

Whether the memory copy reduced here can improve the performance of the application still needs to be tested in combination with the specific application. When the data packet is of a certain size and used in a suitable way, it can have a certain performance optimization effect, but the optimization effect is not necessarily obvious. For example, only about 2-3% improvement.

Application layer to protocol stack By providing a separate zero-copy API, the application layer can avoid data copying from the application layer to the BSD protocol stack when sending data through the socket interface. For details, see submission e12886c , and a more specific introduction will be given below.

An Introduction Provide a separate zero-copy structure for the application layer cache structure, which should be used for subsequent data operations and transmissions at the application layer. The specific types are as follows:ff_zc_mbuf

struct ff_zc_mbuf { void bsd_mbuf; / Point to the head node of the BSD mbuf chain / void bsd_mbuf_off; / Point to the current node after the offset off in the BSD mbuf chain / int off; / Offset in the mbuf chain , the application layer should not directly modify / int len; / The total length of the mbuf chain cache applied for is less than or equal to the data length that the mbuf chain can actually carry/ }; Provides an interface for the application to apply in advance to include a structure that can be directly used by the kernel as the application layer data cache. The interface declaration is as follows.ff_zc_mbuf_get()mbuf

int ff_zc_mbuf_get(struct ff_zc_mbuf m, int len); The interface inputs the pointer and the total length of the cache to be applied for , and the first address is stored in the variable of the structure through the allocation chain internally , and can be passed to the interface later.struct ff_zc_mbuf m_getm2()mbufff_zc_mbufbsd_mbufff_write()

Among them , the standard socket interface allocates the interface of the chain when copying the application layer data to the protocol stack, so using the chain of the interface range as the application layer cache can be completely compatible when sending data.m_getm2()mbufmbuf

Provides a cache data write function , the function declaration is as follows,ff_zc_mbuf_write()

int ff_zc_mbuf_write(struct ff_zc_mbuf m,char data, int len); When the application layer saves the data to be sent, it should directly write the data to the cache of the pointed chain through the interface. The interface can be called multiple times to write the cached data , and the offset of the cache is automatically handled internally by the interface . The write length cannot exceed the initially requested cache length .ff_zc_mbuf_wirte()ff_zc_mbufmbufff_zc_mbuf_wirte()

When the application calls the interface, it is specified and passed as a parameter. The example is as follows,ff_write()ff_zc_mubf.bsd_mbufbuf

ff_write(clientfd, zc_buf.bsd_mbuf, buf_len); In the function, the first address of the passed chain is directly used without additional chain allocation and data copying, as shown below,m_uiotombuf()mbufmbuf

ifdef FSTACK_ZC_SEND

if (uio->uio_segflg == UIO_SYSSPACE && uio->uio_rw == UIO_WRITE) { m = (struct mbuf )uio->uio_iov->iov_base; / Use the mbuf chain head address of the application layer directly/ uio->uio_iov->iov_base = (char )(uio->uio_iov->iov_base) + total; uio->uio_iov->iov_len = 0; uio->uio_resid = 0; uio->uio_offset = total; progress = total; } else { #endif m = m_getm2(NULL, max(total + align, 1), how, MT_DATA, flags); / copy mode allocation mbuf chain/ if (m == NULL) return (NULL); m->m_data += align;

/ Fill all mbufs with uio data and update header information. / for (mb = m; mb != NULL; mb = mb->m_next) { length = min(M_TRAILINGSPACE(mb), total - progress);

error = uiomove(mtod(mb, void ), length, uio); / Copy the application layer data to the protocol stack in copy mode*/ if (error) { m_freem(m); return (NULL); }

mb->m_len = length; progress += length; if (flags & M_PKTHDR) m->m_pkthdr.len += length; }

ifdef FSTACK_ZC_SEND

}

endif

After the function returns successfully, the internal chain data of the previously applied structure does not need to be released , and the structure can be reused in the function to reallocate the BSD chain .ff_write()ff_zc_mbufmbufff_zc_mbuf_get()mbuf

It cannot be used directly again, and it must be called and allocated a new chain before it can continue to be usedff_zc_mbuf_wirte()ff_zc_mbuf_get()mbuf .

The usage of the zero-copy sending interface is also different from the standard socket interface. For details, please refer to the previous solution introduction and sample code .

Precautions Using the zero-copy sending interface requires modification of the original application to access, and does not necessarily have obvious performance improvement, so it is not enabled by default.

The zero-copy sending interface can be enabled and used alone or together.FF_ZC_SENDFF_USE_PAGE_ARRAY

Similar to the zero-copy of the protocol stack to DPDK, whether the memory copy reduced here can improve the application performance still needs to be tested in combination with the specific application. In a specific application scenario, there will be a certain performance improvement, but the effect is not necessarily Obviously, for example, there is only about 2-3% improvement.

The article has been modified on 2022-04-17

F-Stack / f-stack

[WeOpenStart] Translant the WeiXinDocument to English #691

ifdef FSTACK_ZC_SEND

ifdef FSTACK_ZC_SEND

endif