Atoptool / atop

System and process monitor for Linux
GNU General Public License v2.0
789 stars 110 forks source link

Netatop bpf #220

Closed liutingjieni closed 1 year ago

liutingjieni commented 1 year ago

Internally at ByteDance, we rely on the per-process network indicators to debug problems. But when collecting netatop.tar based on the community version, I found that there exists performance problems, especially in the scenario of large numbers of processes and heavy network traffic. So I started to try to optimize netatop:

As we all know, this function is divided into 2 parts: one is responsible for collection, such as netatop.ko or bpf, the code of bpf is https://github.com/Atoptool/atop/commit/57d49aafd6be5e84304f8b7c999fae40a3f867a1 .The other is the code responsible for processing and displaying indicators. I also made related changes to this block. The code is https://github.com/Atoptool/atop/commit/86fd15698c7e970ff9d984517369c059bb0b5f75 . I have tried to use (ip_rcv/ip_output, ip_local_deliver/ip_queue_xmit) hook points in the network layer, but at this time the receiving data packet is still in soft interrupt, and there is no way to get the pid of the process. It will need a data structure similar to SBUCKS/TBUCKS if it still uses these hook points. I think this performance is relatively low. [1]So I used hook points inside ByteDance, which is still on the road to open source. The above is my design and realization for netatop. Could you help to review these patches and give some suggestions? Thanks a lot!

Atoptool commented 1 year ago

As stated in the comments of pull request #208, I am busy finalising a BPF module myself which uses a similar approach as the original netatop module: stats of received packets will be remembered without assignment to a pid/tid until packets are transmitted in the context of the process/thread and assignment to a pid/tid is possible. I have tried several kernel hooks myself to determine which one is the best to capture the transmitted and received packets, but could not find the proper hook particularly for transmitted packages. I used net_dev_start_xmit which is usually called in the context of the sending process but not if packets get queued; then packets will arrive in the context of kworker or ksoftirq which will be the process that later traffic will be assigned to as well (which is obviously misleading).

I tried to compile and run your code, but I get a compilation error:

kprobe.bpf.c:79:14: error: redefinition of 'bytedance_function'
int BPF_PROG(bytedance_function, struct sock *sk, int length, int error, int flags)
             ^
kprobe.bpf.c:42:14: note: previous definition is here
int BPF_PROG(bytedance_function, struct sock *sk, int length, int error, int flags)

I can see that you use UNIX domain sockets to transfer the stats from the BPF server to atop. This is also what I use in my own implementation. However in your implementation several request will be issued per atop sample while in my implementation only one transfer is issued to get the stats for all processes. When we use a uniform interface to communicate from atop (as a client) to BPF (as a server), several BPF implementations could coexist in the future.

However the BPF implementation (also my own) will not be part of the atop repo so I will not merge this pull request.

liutingjieni commented 1 year ago

The reason for the Compilation error is that the hook point I am currently using is not open source, using string instead of the hook point, so it does not pass the Compilation. Next I will try to push the hook point open source to the community. The hook point I use is in the context of the process's syscall, so the statistical information is from the perspective of the process. What is your opinion on the significance of such statistical information?

On the choice of hook point I think you can try ip_rcv/ip_output, ip_local_deliver/ip_queue_xmit .

I chose to send per task to the bpf side for compatibility with the original atop program. But your approach will be more efficient. And I am very much in favor of using a a uniform interface to communicate from atop (as a client) to BPF (as a server).

The code for bpf is not intended to be part of the atop repo, but in a separate repo; it is just for the convenience of mentioning pr.

liutingjieni commented 1 year ago

I'm now trying to push our internal hook points to the community and have made good progress.

liutingjieni commented 1 year ago

I have created a new repository for the netatop-bpf code, which can be found at ​netatop-bpf. Since it uses a tracepoint that was recently submitted to the linux community, it requires a kernel version >= 6.3 to compile and run successfully.

I have also modified the atop code according to your feedback to only send one transfer to get the statistics of all processes.

Hope to get your code review!

liutingjieni commented 1 year ago

After consulting the kernel community, the kernel function (sock_recvmsg_nosec) is more suitable as our hook point for per-process-bandwidth-monitoring. And because the tracepoint method performs better than the kprobe method, we added a static probe to the kernel function (sock_recvmsg_nosec) and pushed it to the community. So in the kernel version > = 6.3, you can use the following tracepoint to collect data:

As mentioned in the previous comment, I have implemented it. netatop-bpf

However, for kernel versions less than 6.3, we used an internal-only tracepoint within Bytedance, which is not applicable to the community. So hook in the kernel function (sock_recvmsg_nosec) through kprobe this requires the community to implement. kprobe points as follows:

Please refer to per-process-bandwidth-monitoring-on-Linux-with-bpftrace.

Atoptool commented 1 year ago

Thanks. I planned a code review at the end of this week.

Atoptool commented 1 year ago

Building netatop-bpf (clone from https://github.com/bytedance/netatop-bpf) does not succeed. First I get errors about clang-11 and llvm_strip-11 that do not exist. After removing '-11' in the Makefile, the following errors are given:

..... BINARY netatop /usr/bin/ld: .output/server.o:(.bss+0x0): multiple definition of skel'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:24: first defined here /usr/bin/ld: .output/server.o:(.bss+0x8): multiple definition ofsemid'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:25: first defined here /usr/bin/ld: .output/server.o:(.bss+0xc): multiple definition of tgid_map_fd'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:26: first defined here /usr/bin/ld: .output/server.o:(.bss+0x10): multiple definition oftid_map_fd'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:27: first defined here /usr/bin/ld: .output/server.o:(.bss+0x14): multiple definition of nr_cpus'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:28: first defined here /usr/bin/ld: .output/server.o:(.bss+0x18): multiple definition ofclient_flag'; .output/netatop.o:/home/gerlof/netatop-bpf/deal.h:5: first defined here /usr/bin/ld: .output/deal.o:(.bss+0x0): multiple definition of skel'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:24: first defined here /usr/bin/ld: .output/deal.o:(.bss+0x8): multiple definition ofsemid'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:25: first defined here /usr/bin/ld: .output/deal.o:(.bss+0xc): multiple definition of tgid_map_fd'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:26: first defined here /usr/bin/ld: .output/deal.o:(.bss+0x10): multiple definition oftid_map_fd'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:27: first defined here /usr/bin/ld: .output/deal.o:(.bss+0x14): multiple definition of nr_cpus'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:28: first defined here /usr/bin/ld: .output/deal.o:(.bss+0x18): multiple definition ofclient_flag'; .output/netatop.o:/home/gerlof/netatop-bpf/deal.h:5: first defined here collect2: error: ld returned 1 exit status make: *** [Makefile:142: netatop] Error 1

Did I miss something?

Atoptool commented 1 year ago

I had a look at the coding to be integrated into atop. A few remarks:

  1. I noticed that the code to interface with netatop-bpf has completely replaced the code that interfaces with the (conventional) netatop kernel module. That would mean that users are forced to switch to the BPF implementation when a new version of atop is released, while there might be reasons to stick to the current netatop module (e.g. for systems with kernel < 6.3). I prefer that both interfaces coexist in future versions of atop.

  2. In the Makefile a reference to -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include is missing in CFLAGS and in the link step of atop -lglib-2.0 is missing.

  3. In photoproc.c an include is missing: #include <glib.h>

  4. Concerning source file netatopif.c:

    • Function netatop_probe() is called for each interval (from photoproc.c) and each time it calls atop_ipopen() to open/bind a new client socket to detect that netatop-bpf is not installed. Please use the access() system call in the function netatop_probe() to check if the name of the netatop server socket exists for an efficient detection if netatop-bpf runs anyhow.

    • Five positions are reserved for the PID in the pathname of a client socket while modern kernels use 7 positions for PID.

    • Is there anyhow a reason to bind the client socket to a path name?

liutingjieni commented 1 year ago

Building netatop-bpf (clone from https://github.com/bytedance/netatop-bpf) does not succeed. First I get errors about clang-11 and llvm_strip-11 that do not exist. After removing '-11' in the Makefile, the following errors are given:

..... BINARY netatop /usr/bin/ld: .output/server.o:(.bss+0x0): multiple definition of skel'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:24: first defined here /usr/bin/ld: .output/server.o:(.bss+0x8): multiple definition ofsemid'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:25: first defined here /usr/bin/ld: .output/server.o:(.bss+0xc): multiple definition of tgid_map_fd'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:26: first defined here /usr/bin/ld: .output/server.o:(.bss+0x10): multiple definition oftid_map_fd'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:27: first defined here /usr/bin/ld: .output/server.o:(.bss+0x14): multiple definition of nr_cpus'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:28: first defined here /usr/bin/ld: .output/server.o:(.bss+0x18): multiple definition ofclient_flag'; .output/netatop.o:/home/gerlof/netatop-bpf/deal.h:5: first defined here /usr/bin/ld: .output/deal.o:(.bss+0x0): multiple definition of skel'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:24: first defined here /usr/bin/ld: .output/deal.o:(.bss+0x8): multiple definition ofsemid'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:25: first defined here /usr/bin/ld: .output/deal.o:(.bss+0xc): multiple definition of tgid_map_fd'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:26: first defined here /usr/bin/ld: .output/deal.o:(.bss+0x10): multiple definition oftid_map_fd'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:27: first defined here /usr/bin/ld: .output/deal.o:(.bss+0x14): multiple definition of nr_cpus'; .output/netatop.o:/home/gerlof/netatop-bpf/netatop.h:28: first defined here /usr/bin/ld: .output/deal.o:(.bss+0x18): multiple definition ofclient_flag'; .output/netatop.o:/home/gerlof/netatop-bpf/deal.h:5: first defined here collect2: error: ld returned 1 exit status make: *** [Makefile:142: netatop] Error 1

Did I miss something?

You can pull the netatop repository again. It should be noted that I have modified the way libbpf & bpftool is a submodule. The error that occurred recently indicates that the updated version of gcc has stricter requirements for strong and weak variables. You can try compiling netatop again.

Atoptool commented 1 year ago

Result after compiling again:

... BINARY netatop /usr/bin/ld: .output/server.o:(.bss+0x0): multiple definition of client_flag'; .output/netatop.o:/home/gerlof/netatop-bpf/deal.h:5: first defined here /usr/bin/ld: .output/deal.o:(.bss+0x0): multiple definition ofclient_flag'; .output/netatop.o:/home/gerlof/netatop-bpf/deal.h:5: first defined here collect2: error: ld returned 1 exit status make: *** [Makefile:142: netatop] Error 1

liutingjieni commented 1 year ago

sorry, You can pull and try compiling netatop again.

liutingjieni commented 1 year ago

I had a look at the coding to be integrated into atop. A few remarks:

  1. I noticed that the code to interface with netatop-bpf has completely replaced the code that interfaces with the (conventional) netatop kernel module. That would mean that users are forced to switch to the BPF implementation when a new version of atop is released, while there might be reasons to stick to the current netatop module (e.g. for systems with kernel < 6.3). I prefer that both interfaces coexist in future versions of atop.
  2. In the Makefile a reference to -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include is missing in CFLAGS and in the link step of atop -lglib-2.0 is missing.
  3. In photoproc.c an include is missing: #include <glib.h>
  4. Concerning source file netatopif.c:

    • Function netatop_probe() is called for each interval (from photoproc.c) and each time it calls atop_ipopen() to open/bind a new client socket to detect that netatop-bpf is not installed. Please use the access() system call in the function netatop_probe() to check if the name of the netatop server socket exists for an efficient detection if netatop-bpf runs anyhow.
    • Five positions are reserved for the PID in the pathname of a client socket while modern kernels use 7 positions for PID.
    • Is there anyhow a reason to bind the client socket to a path name?

l will modify my code according to your comments.

Atoptool commented 1 year ago

Compiling the netatop-bpf code succeeds now. However no network activity is shown by atop (all counters remain zero). A system call trace with strace shows that only one struct netpertask is returned on every sendto request.

liutingjieni commented 1 year ago

Whether the field'Failed to attach BPF skeleton 'is printed, I guess there is no merge of the tracepoint in your machine kernel yet. See the ptach of the tracepint merge community for details

This patch is merged into the kernel, and the specific kernel version is as follows:

git describe --contains 6e6eda44
v6.3-rc1~162^2~298

Kernel version should be >= 6.3-rc1 or try backporting the patch to old kernel.

Atoptool commented 1 year ago

I used mainline kernel 6.3.9-1 for my tests.

liutingjieni commented 1 year ago

sorry, I have tried running it in kernel 6.3,now should be ok. You need to pull and try to reintsall netatop again. My error is that I backported an intermediate version of the patch and the name of the tracepoint was changed.

liutingjieni commented 1 year ago

I had a look at the coding to be integrated into atop. A few remarks:

  1. I noticed that the code to interface with netatop-bpf has completely replaced the code that interfaces with the (conventional) netatop kernel module. That would mean that users are forced to switch to the BPF implementation when a new version of atop is released, while there might be reasons to stick to the current netatop module (e.g. for systems with kernel < 6.3). I prefer that both interfaces coexist in future versions of atop.
  2. In the Makefile a reference to -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include is missing in CFLAGS and in the link step of atop -lglib-2.0 is missing.
  3. In photoproc.c an include is missing: #include <glib.h>
  4. Concerning source file netatopif.c:

    • Function netatop_probe() is called for each interval (from photoproc.c) and each time it calls atop_ipopen() to open/bind a new client socket to detect that netatop-bpf is not installed. Please use the access() system call in the function netatop_probe() to check if the name of the netatop server socket exists for an efficient detection if netatop-bpf runs anyhow.
    • Five positions are reserved for the PID in the pathname of a client socket while modern kernels use 7 positions for PID.
    • Is there anyhow a reason to bind the client socket to a path name?

According to your opinion, I have completed the modification, and now the interface of netatop coexists. If there is a bpf program, its data will be used first.

Hope to you reply, Thanks!

Atoptool commented 1 year ago

Thanks for the new version. With the combination of the new modifications for atop and the new code of netatop-bpf I get the counters showing the network traffic of the individual processes. Although I did not finish testing yet, some observations:

  1. I noticed that the network traffic measured by the netatop-bpf module only concerns the activity of processes on system call level, i.e. the sizes of the ethernet header, IP header and TCP/UDP header are not included. Also in case that TCP packets are only transmitted by by a process (unidirectional), you would still expect TCP packets to be received for this connection on a lower level to acknowledge the transfers. Since this traffic is not measured, at least 10% of the load that is measured on system level (when looking at the si and so for the network interfaces) is lost when comparing with the BANDWI and BANDWO on process level. In my opinion, the total of BANDWI and BANDWO for all processes should match the measured bandwidth on interface level (si and so) as close as possible, comparable with the netatop kernel module.

  2. In the generic output of atop the SNET and RNET columns are not shown as soon as the netatop-bpf is active (the NETATOPBPF flag is not checked in showlinux.c when formatting the columns for the generic output).

liutingjieni commented 1 year ago
  1. Because if the function position of our hook is at the ip layer like the previous kernel module, this will cause the event to be triggered in the soft interrupt rather than in the context of the process. So a hash table needs to be had to maintain the relationship between the quintuple information and the process, which will lead to some performance degradation. Now the function of the hook at the transport layer will cause the data counted by the process and the system to have some differences, but I think this difference is acceptable compared to performance degradation.
  2. In the generic output of atop the SNET and RNET columns are shown as soon as the netatop-bpf is active
Atoptool commented 1 year ago

I performed some tests with netatop-bpf to compare the results with the netatop kernel module:

Test 1: Unidirectional transfer using a buffer size of 50 bytes via TCP:

NET | transport    |  tcpi   63573 | tcpo  189043 | udpi       0  | udpo       0 | tcpao      0  |
NET | network      |  ipi    63574 | ipo   111011 | ipfrw      0  | deliv  63573 | icmpi      0  |
NET | ens192    2% |  pcki   63587 | pcko  111011 | sp   10 Gbps  | si 3357 Kbps | so  212 Mbps  |

    PID     TID  TCPRCV TCPRASZ  TCPSND TCPSASZ  BANDWI     BANDWO  NET  CMD
 542010       -       0       0  5159e3      50  0 Kbps   206 Mbps 100%  attract
 541789       -       0       0      97      52  0 Kbps     4 Kbps   0%  sshd

The traffic measured on interface level (si and so for ens192) and on process level (BANDWI and BANDWO) is rather close because the small packets transmitted by the application will be combined into larger chunks before transmitting them to the interface. Therefore, the overhead of the protocol headers (that is not measured by netatop-bpf) is relatively small in this case.

Test 2: Bidirectional transfer (request/response) using a buffer size of 50 bytes via TCP:

NET | transport    |  tcpi   65690 | tcpo   65695 | udpi       0  | udpo       0 | tcpao      2  |
NET | network      |  ipi    65694 | ipo    65694 | ipfrw      0  | deliv  65690 | icmpi      0  |
NET | ens192    0% |  pcki   65711 | pcko   65691 | sp   10 Gbps  | si 6097 Kbps | so 6098 Kbps  |

    PID     TID  TCPRCV TCPRASZ  TCPSND TCPSASZ      BANDWI     BANDWO  NET  CMD
 543076       -   65656      50   65656      50   2626 Kbps  2626 Kbps 100%  attract
 541789       -       0       0      91      55      0 Kbps     4 Kbps   0%  sshd

With bidirectional transfer the application message can not be combined any more. When comparing the traffic measured on interface level (si and so for ens192) and on process level (BANDWI and BANDWO), 57% of the network traffic could not be accounted to processes, while the same process (attract) is also responsible for that traffic load!

Test 3: Reading a huge file via NFS (measured on the NFS client) via TCP:

NET | transport    |  tcpi   45325 | tcpo   36989 | udpi       0  | udpo       0 | tcpao      0  |
NET | network      |  ipi    45329 | ipo    36995 | ipfrw      0  | deliv  45325 | icmpi      0  |
NET | ens192   18% |  pcki   45335 | pcko   36996 | sp   10 Gbps  | si 1854 Mbps | so 4671 Kbps  |

    PID     TID  TCPRCV TCPRASZ  TCPSND TCPSASZ     BANDWI     BANDWO  NET  CMD
 542102       -       0       0    4498     192     0 Kbps   690 Kbps  25%  kworker/u4:15-
 542085       -       0       0    4076     192     0 Kbps   626 Kbps  23%  kworker/u4:0-w
 542099       -       0       0    3412     192     0 Kbps   524 Kbps  19%  kworker/u4:14-
 542098       -       0       0    2854     192     0 Kbps   438 Kbps  16%  kworker/u4:13-
 542087       -       0       0    2812     192     0 Kbps   431 Kbps  16%  kworker/u4:5-x
 541789       -       0       0     139      52     0 Kbps     5 Kbps   0%  sshd

On interface level 1854 Mbps is measured while on process level no incoming traffic is measured at all, probably because the TCP transfers are done by the kworker kernel processes.

Based on the results of test 2 and 3, I do not agree that the difference between system level and process level metrics is acceptable compared to performance degradation. From your measurements I understand that netatop (v1) gives about 6% more overhead compared to netatop-bpf (v2), but netatop is able to account all traffic to processes.

Suggestion: would it be possible to configure a future version of netatop-bpf to use a lower level trace point to get more accurate metrics at the cost of more overhead (besides the current implementation with a higher level trace point)?

liutingjieni commented 1 year ago

I found a bug in counting BNADWI traffic, so I fixed it. Here are the results I recorded based on your experiment, hoping to get your review again.

Test 1: Unidirectional transfer using a buffer size of 54 bytes via TCP: 20230727-203715

Test 2: Bidirectional transfer (request/response) using a buffer size of 54 bytes via TCP: img_v2_9e1cfe48-565e-4554-a542-29c81aa985fg

Test 3: Reading a huge file via NFS (scp) via TCP: img_v2_c2d73d63-35af-4a68-b38a-12fee3c7fe9g

Atoptool commented 1 year ago

FYI: my original test 1 was just there to emphasise that the uni-directional traffic was okay because the protocol overhead (which is not accounted) when you transmit big data blocks.

I repeated test 2 (bidirectional transfer) which gives me the same results as before: about 50 to 60% of the traffic on system level (si/so) is not accounted to the processes (BANDWI/BANDWO). In my opinion this is still caused by measuring the traffic at a level which is too high in the protocol stack.

I also repeated test 3 on a NFS client reading a huge file with the cp command (beware, not with scp): now I get the proper metrics for BANDWI for the kworker processes that corresponds to the si on system level, so that looks good!

liutingjieni commented 1 year ago

Would you consider forking the netatop-bpf into atoptool?

Atoptool commented 1 year ago

I prefer keeping additions separated from atoptool, similar to the netatop module itself. However, I can make a link to netatop-bpf (also from atoptool.nl) and mention it in the man page.

liutingjieni commented 1 year ago

I prefer keeping additions separated from atoptool, similar to the netatop module itself. However, I can make a link to netatop-bpf (also from atoptool.nl) and mention it in the man page.

I support you!