eunomia-bpf / bpftime

Userspace eBPF runtime for Observability, Network & General Extensions Framework
https://eunomia.dev/bpftime/
MIT License
788 stars 74 forks source link

Add alive check for syscall server #287

Closed Officeyutong closed 4 months ago

Officeyutong commented 5 months ago

Closes #178

Adds a watchdog for agents. If syscall server was dead, agents will exit automacitally.

When starting up, both agent and server will start a separate thread. The thread at server side will keep updating a time stamp stored in shared memory every 50ms, indicating that server is still alive till this time. At the agent side, the thread will keep reading this time stamp from shared memory, and check if the time stamp hasn't been updated for over 150ms. If succeeded, it will regard server as dead, and start to detach.

Demo

Term1

root@mnfe-pve:~/bpftime/example/malloc# bpftime load ./malloc
[2024-05-01 18:53:44.152] [info] [syscall_context.hpp:86] manager constructed
libbpf: loading object 'malloc_bpf' from buffer
libbpf: elf: section(2) .symtab, size 192, link 1, flags 0, type=2
libbpf: elf: section(3) uprobe/libc.so.6:malloc, size 440, link 0, flags 6, type=1
libbpf: sec 'uprobe/libc.so.6:malloc': found program 'do_count' at insn offset 0 (0 bytes), code size 55 insns (440 bytes)
libbpf: elf: section(4) .rodata.str1.1, size 27, link 0, flags 32, type=1
libbpf: elf: section(5) .maps, size 32, link 0, flags 3, type=1
libbpf: elf: section(6) license, size 4, link 0, flags 3, type=1
libbpf: license of malloc_bpf is GPL
libbpf: elf: section(7) .reluprobe/libc.so.6:malloc, size 64, link 2, flags 40, type=9
libbpf: elf: section(8) .BTF, size 1434, link 0, flags 0, type=1
libbpf: elf: section(9) .BTF.ext, size 384, link 0, flags 0, type=1
libbpf: looking for externs among 8 symbols...
libbpf: collected 0 externs total
libbpf: map 'libc_malloc_calls_total': at sec_idx 5, offset 0.
libbpf: map 'libc_malloc_calls_total': found type = 1.
libbpf: map 'libc_malloc_calls_total': found key [8], sz = 4.
libbpf: map 'libc_malloc_calls_total': found value [12], sz = 8.
libbpf: map 'libc_malloc_calls_total': found max_entries = 1024.
libbpf: map '.rodata.str1.1' (global data): at sec_idx 4, offset 0, flags 80.
[2024-05-01 18:53:44.156] [info] [syscall_server_utils.cpp:24] Initialize syscall server
[2024-05-01 18:53:44][error][1637850] pkey_alloc failed
[2024-05-01 18:53:44][info][1637850] Global shm constructed. shm_open_type 0 for bpftime_maps_shm
[2024-05-01 18:53:44][info][1637850] Global shm initialized
[2024-05-01 18:53:44][info][1637850] Enabling helper groups ufunc, kernel, shm_map by default
[2024-05-01 18:53:44][info][1637850] bpftime-syscall-server started
[2024-05-01 18:53:44][info][1637851] Server side watchdog started
libbpf: map 1 is ".rodata.str1.1"
libbpf: sec '.reluprobe/libc.so.6:malloc': collecting relocation for section(3) 'uprobe/libc.so.6:malloc'
libbpf: sec '.reluprobe/libc.so.6:malloc': relo #0: insn #24 against 'libc_malloc_calls_total'
libbpf: prog 'do_count': found map 0 (libc_malloc_calls_total, sec 5, off 0) for insn #24
libbpf: sec '.reluprobe/libc.so.6:malloc': relo #1: insn #32 against 'libc_malloc_calls_total'
libbpf: prog 'do_count': found map 0 (libc_malloc_calls_total, sec 5, off 0) for insn #32
libbpf: sec '.reluprobe/libc.so.6:malloc': relo #2: insn #37 against 'libc_malloc_calls_total'
libbpf: prog 'do_count': found map 0 (libc_malloc_calls_total, sec 5, off 0) for insn #37
libbpf: sec '.reluprobe/libc.so.6:malloc': relo #3: insn #49 against 'libc_malloc_calls_total'
libbpf: prog 'do_count': found map 0 (libc_malloc_calls_total, sec 5, off 0) for insn #49
libbpf: map 'libc_malloc_calls_total': created successfully, fd=4
libbpf: map '.rodata.str1.1': created successfully, fd=5
libbpf: resolved 'libc.so.6' to '/lib/x86_64-linux-gnu/libc.so.6'
libbpf: elf: symbol address match for 'malloc' in '/lib/x86_64-linux-gnu/libc.so.6': 0x98860
[2024-05-01 18:53:44][info][1637850] Created uprobe/uretprobe perf event handler, module name /lib/x86_64-linux-gnu/libc.so.6, offset 98860
18:53:45 
18:53:46 

Term2

Croot@mnfe-pve:~/bpftime/example/malloc# bpftime start ./victim
[2024-05-01 18:54:37.575] [info] [agent.cpp:75] Entering bpftime agent
[2024-05-01 18:54:37.576] [error] [bpftime_shm_internal.cpp:669] pkey_alloc failed
[2024-05-01 18:54:37.576] [info] [bpftime_shm_internal.cpp:687] Global shm constructed. shm_open_type 1 for bpftime_maps_shm
[2024-05-01 18:54:37.576] [info] [bpftime_shm_internal.cpp:38] Global shm initialized
[2024-05-01 18:54:37.577] [info] [bpftime_shm_internal.cpp:833] Agent side watchdog started
[2024-05-01 18:54:37.578] [info] [bpf_attach_ctx.cpp:171] Register attach-impl defined helper bpf_get_func_arg, index 183
[2024-05-01 18:54:37.578] [info] [bpf_attach_ctx.cpp:171] Register attach-impl defined helper bpf_get_func_ret_id, index 184
[2024-05-01 18:54:37.578] [info] [bpf_attach_ctx.cpp:171] Register attach-impl defined helper bpf_get_retval, index 186
[2024-05-01 18:54:37.578] [info] [agent.cpp:162] Initializing agent..
[2024-05-01 18:54:37][info][1638300] Initializing llvm
[2024-05-01 18:54:37][warning][1638300] Not implemented yet: toggle_bounds_check
[2024-05-01 18:54:37][info][1638300] Executable path: /root/bpftime/example/malloc/victim
malloc called from pid 1638300
malloc called from pid 1638300
[2024-05-01 18:54:37][info][1638300] Attach successfully
malloc called from pid 1638300
continue malloc...
malloc called from pid 1638300

Then, stop syscall server (Ctrl+C) and have a look at agent

continue malloc...
malloc called from pid 1638300
[2024-05-01 18:54:39][error][1638300] Expected fd 4 to be a map fd (map_ptr_by_fd)
[2024-05-01 18:54:39][error][1638300] Expected fd 4 to be a map fd (map_ptr_by_fd)
[2024-05-01 18:54:39][error][1638300] Expected fd 4 to be a map fd (map_ptr_by_fd)
continue malloc...
[2024-05-01 18:54:39][warning][1638301] Syscall server seems to be dead, agent will exit now
malloc called from pid 1638300
[2024-05-01 18:54:39][error][1638301] Expected fd 4 to be a map fd (map_ptr_by_fd)
[2024-05-01 18:54:39][error][1638301] Expected fd 4 to be a map fd (map_ptr_by_fd)
[2024-05-01 18:54:39][error][1638301] Expected fd 4 to be a map fd (map_ptr_by_fd)
[2024-05-01 18:54:39][info][1638301] Agent side watchdog exited
continue malloc...
continue malloc...
continue malloc
yunwei37 commented 5 months ago

I think we might need some further discussion on this design.

For example, what happens if the agent or server is running a fork()?

Officeyutong commented 5 months ago

I think we might need some further discussion on this design.

For example, what happens if the agent or server is running a fork()?

The new process will only contain the thread that executed fork()

Officeyutong commented 5 months ago

I think we might need some further discussion on this design.

For example, what happens if the agent or server is running a fork()?

We may need to discuss about how we should handle forked processes

yunwei37 commented 1 month ago

Can we reopen this?