Closed dukeartem closed 2 months ago
I cannot exactly reproduce the case that is presented, but I think this happens because the mesh state that it removes is still referencing a query that has already been deleted. To stop that, I added a call to remove the mesh state once the reply has been sent. That should then stop the code from calling the mesh_state_remove_reply routine, because the reply pointer has been set to NULL because the query has stopped. And then it should no longer crash.
Another commit that fixes the same issue, in case the http2 reply is dropped. Then it also cleans up the mesh state reference. So that in a postponed closed channel, the delete routine then does not call mesh_state_remove_reply, and thus also does not crash there.
Unfortunately, after patched and one day work, i have caught new trace
==826107==ERROR: AddressSanitizer: heap-use-after-free on address 0x629001ced4c0 at pc 0x000001532e27 bp 0x7ffbb47b0810 sp 0x7ffbb47b07f0
READ of size 8 at 0x629001ced4c0 thread T14
#0 0x1532e26 (/unbound-1.20-sanitizer+0x1532e26) mesh_state_remove_reply unbound/services/mesh.c:2094
#1 0x164781f (/unbound-1.20-sanitizer+0x164781f) http2_stream_delete unbound/util/netevent.c:3293
#2 0x164752d (/unbound-1.20-sanitizer+0x164752d) http2_stream_close_cb unbound/util/netevent.c:3370
#3 0x14c4db3 (/unbound-1.20-sanitizer+0x14c4db3) nghttp2_session_close_stream contrib/libs/nghttp2/lib/nghttp2_session.c:1496
#4 0x14cbc6f (/unbound-1.20-sanitizer+0x14cbc6f) nghttp2_session_on_rst_stream_received contrib/libs/nghttp2/lib/nghttp2_session.c:4554
#5 0x14dcd90 (/unbound-1.20-sanitizer+0x14dcd90) session_process_rst_stream_frame contrib/libs/nghttp2/lib/nghttp2_session.c:4569
#6 0x14d6f1a (/unbound-1.20-sanitizer+0x14d6f1a) nghttp2_session_mem_recv2 contrib/libs/nghttp2/lib/nghttp2_session.c:6548
#7 0x14e03c7 (/unbound-1.20-sanitizer+0x14e03c7) nghttp2_session_recv contrib/libs/nghttp2/lib/nghttp2_session.c:7366
#8 0x165d1d7 (/unbound-1.20-sanitizer+0x165d1d7) comm_point_http2_handle_read unbound/util/netevent.c:3459
#9 0x164940c (/unbound-1.20-sanitizer+0x164940c) comm_point_http_handle_read unbound/util/netevent.c:3510
#10 0x164056b (/unbound-1.20-sanitizer+0x164056b) comm_point_http_handle_callback unbound/util/netevent.c:3897
#11 0x9dc58f (/unbound-1.20-sanitizer+0x9dc58f) event_persist_closure contrib/libs/libevent/event.c:1623
#12 0x9d9e1a (/unbound-1.20-sanitizer+0x9d9e1a) event_process_active_single_queue contrib/libs/libevent/event.c:1682
#13 0x9ca2b2 (/unbound-1.20-sanitizer+0x9ca2b2) event_process_active contrib/libs/libevent/event.c:1783
#14 0x9c6d9e (/unbound-1.20-sanitizer+0x9c6d9e) event_base_loop contrib/libs/libevent/event.c:2006
#15 0x9c6176 (/unbound-1.20-sanitizer+0x9c6176) event_base_dispatch contrib/libs/libevent/event.c:1817
#16 0x811094 (/unbound-1.20-sanitizer+0x811094) ub_event_base_dispatch unbound/util/ub_event.c:280
#17 0x163655e (/unbound-1.20-sanitizer+0x163655e) comm_base_dispatch unbound/util/netevent.c:282
#18 0x8091dd (/unbound-1.20-sanitizer+0x8091dd) worker_work unbound/daemon/worker.c:2357
#19 0x7c079e (/unbound-1.20-sanitizer+0x7c079e) thread_start unbound/daemon/daemon.c:638
#20 0x90b833 (/unbound-1.20-sanitizer+0x90b833) _ZN6__asan10AsanThread11ThreadStartEy /-S/contrib/libs/clang16-rt/lib/asan/asan_thread.cpp:277
#21 0x8ca076 (/unbound-1.20-sanitizer+0x8ca076) _ZL17asan_thread_startPv /-S/contrib/libs/clang16-rt/lib/asan/asan_interceptors.cpp:199
#22 0x7ffbc2dc6b42 (/lib/x86_64-linux-gnu/libc.so.6+0x94b42) (BuildId: 69389d485a9793dbe873f0ea2c93e02efaa9aa3d)
#23 0x7ffbc2e589ff (/lib/x86_64-linux-gnu/libc.so.6+0x1269ff) (BuildId: 69389d485a9793dbe873f0ea2c93e02efaa9aa3d)
0x629001ced4c0 is located 704 bytes inside of 16384-byte region [0x629001ced200,0x629001cf1200)
freed by thread T14 here:
#0 0x8fc090 (/unbound-1.20-sanitizer+0x8fc090) free /-S/contrib/libs/clang16-rt/lib/asan/asan_malloc_linux.cpp:52
#1 0x166ce9d (/unbound-1.20-sanitizer+0x166ce9d) regional_destroy unbound/util/regional.c:141
#2 0x15c51b0 (/unbound-1.20-sanitizer+0x15c51b0) alloc_reg_release unbound/util/alloc.c:345
#3 0x15252c4 (/unbound-1.20-sanitizer+0x15252c4) mesh_state_cleanup unbound/services/mesh.c:993
#4 0x151cbc8 (/unbound-1.20-sanitizer+0x151cbc8) mesh_state_delete unbound/services/mesh.c:1031
#5 0x1531ca7 (/unbound-1.20-sanitizer+0x1531ca7) mesh_continue unbound/services/mesh.c:1929
#6 0x1522867 (/unbound-1.20-sanitizer+0x1522867) mesh_run unbound/services/mesh.c:1960
#7 0x15247f8 (/unbound-1.20-sanitizer+0x15247f8) mesh_report_reply unbound/services/mesh.c:856
#8 0x7f0ca7 (/unbound-1.20-sanitizer+0x7f0ca7) worker_handle_service_reply unbound/daemon/worker.c:269
#9 0x156249a (/unbound-1.20-sanitizer+0x156249a) serviced_callbacks unbound/services/outside_network.c:3051
#10 0x1567208 (/unbound-1.20-sanitizer+0x1567208) serviced_udp_callback unbound/services/outside_network.c:3392
#11 0x1554f72 (/unbound-1.20-sanitizer+0x1554f72) outnet_udp_cb unbound/services/outside_network.c:1537
#12 0x163d965 (/unbound-1.20-sanitizer+0x163d965) comm_point_udp_callback unbound/util/netevent.c:1145
#13 0x9dc58f (/unbound-1.20-sanitizer+0x9dc58f) event_persist_closure contrib/libs/libevent/event.c:1623
#14 0x9d9e1a (/unbound-1.20-sanitizer+0x9d9e1a) event_process_active_single_queue contrib/libs/libevent/event.c:1682
#15 0x9ca2b2 (/unbound-1.20-sanitizer+0x9ca2b2) event_process_active contrib/libs/libevent/event.c:1783
#16 0x9c6d9e (/unbound-1.20-sanitizer+0x9c6d9e) event_base_loop contrib/libs/libevent/event.c:2006
#17 0x9c6176 (/unbound-1.20-sanitizer+0x9c6176) event_base_dispatch contrib/libs/libevent/event.c:1817
#18 0x811094 (/unbound-1.20-sanitizer+0x811094) ub_event_base_dispatch unbound/util/ub_event.c:280
#19 0x163655e (/unbound-1.20-sanitizer+0x163655e) comm_base_dispatch unbound/util/netevent.c:282
#20 0x8091dd (/unbound-1.20-sanitizer+0x8091dd) worker_work unbound/daemon/worker.c:2357
#21 0x7c079e (/unbound-1.20-sanitizer+0x7c079e) thread_start unbound/daemon/daemon.c:638
#22 0x90b833 (/unbound-1.20-sanitizer+0x90b833) _ZN6__asan10AsanThread11ThreadStartEy /-S/contrib/libs/clang16-rt/lib/asan/asan_thread.cpp:277
previously allocated by thread T0 here:
#0 0x8fc3c3 (/unbound-1.20-sanitizer+0x8fc3c3) malloc /-S/contrib/libs/clang16-rt/lib/asan/asan_malloc_linux.cpp:69
#1 0x166cb1c (/unbound-1.20-sanitizer+0x166cb1c) regional_create_custom_large_object unbound/util/regional.c:94
#2 0x166cadc (/unbound-1.20-sanitizer+0x166cadc) regional_create_custom unbound/util/regional.c:108
#3 0x15c331f (/unbound-1.20-sanitizer+0x15c331f) prealloc_blocks unbound/util/alloc.c:91
#4 0x15c3258 (/unbound-1.20-sanitizer+0x15c3258) alloc_init unbound/util/alloc.c:122
#5 0x7bd153 (/unbound-1.20-sanitizer+0x7bd153) daemon_create_workers unbound/daemon/daemon.c:581
#6 0x7bbb92 (/unbound-1.20-sanitizer+0x7bbb92) daemon_fork unbound/daemon/daemon.c:798
#7 0x7eb5b9 (/unbound-1.20-sanitizer+0x7eb5b9) run_daemon unbound/daemon/unbound.c:738
#8 0x7eaa26 (/unbound-1.20-sanitizer+0x7eaa26) main unbound/daemon/unbound.c:851
#9 0x7ffbc2d5bd8f (/lib/x86_64-linux-gnu/libc.so.6+0x29d8f) (BuildId: 69389d485a9793dbe873f0ea2c93e02efaa9aa3d)
Thread T14 created by T0 here:
#0 0x8c9f44 (/unbound-1.20-sanitizer+0x8c9f44) pthread_create /-S/contrib/libs/clang16-rt/lib/asan/asan_interceptors.cpp:208
#1 0x7bd81c (/unbound-1.20-sanitizer+0x7bd81c) daemon_start_others unbound/daemon/daemon.c:654
#2 0x7bbb9e (/unbound-1.20-sanitizer+0x7bbb9e) daemon_fork unbound/daemon/daemon.c:809
#3 0x7eb5b9 (/unbound-1.20-sanitizer+0x7eb5b9) run_daemon unbound/daemon/unbound.c:738
#4 0x7eaa26 (/unbound-1.20-sanitizer+0x7eaa26) main unbound/daemon/unbound.c:851
#5 0x7ffbc2d5bd8f (/lib/x86_64-linux-gnu/libc.so.6+0x29d8f) (BuildId: 69389d485a9793dbe873f0ea2c93e02efaa9aa3d)
SUMMARY: AddressSanitizer: heap-use-after-free (/unbound-1.20-sanitizer+0x1532e26) (BuildId: 536b8443a1d84c8ba2dfd3bcef518ba8fce2555b)
Shadow bytes around the buggy address:
0x629001ced200: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x629001ced280: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x629001ced300: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x629001ced380: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x629001ced400: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
=>0x629001ced480: fd fd fd fd fd fd fd fd[fd]fd fd fd fd fd fd fd
0x629001ced500: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x629001ced580: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x629001ced600: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x629001ced680: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
0x629001ced700: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==826107==ABORTING
I get the impression from the trace that not both patches are included, and specifically not the second one, https://github.com/NLnetLabs/unbound/commit/d52f501d903909096b9d971cbda9e5b65eba6777 is included in the patched unbound. Did the patches, both of them, get used by the patched unbound?
i re-checked
and
readelf -Ws ./unbound-1.20-sanitizer | grep http2_stream_remove_mesh_state
1997: 0000000001647300 83 FUNC GLOBAL DEFAULT 16 http2_stream_remove_mesh_state
65876: 0000000001647300 83 FUNC GLOBAL DEFAULT 16 http2_stream_remove_mesh_state
maybe i can gather another debug info? Or test some ugly hack, just for catch the problem.
The commit https://github.com/NLnetLabs/unbound/commit/8947c2c7646c2f8646b3e10efe25552f5e789068 adds some more fixes. I found several code paths where a connection is dropped, but then the mesh state reference would still be wrong. The commit passes unit tests. Thank you for testing this!
nope,
but only one for 12 hours, maybe exist another code paths where a connection is dropped
The commit https://github.com/NLnetLabs/unbound/commit/8fca3e7c5b0acdee9a2b687cddc41e903c0788d5 is another attempt. It looks like this fixes more. It now removes, for dropped connections, the mesh state for the stream that is associated with the reply, and not the currently, most recent http2 stream, what it did before. I guess in some case there was use of multiple http2 streams, and one was closed, and then it removed the mesh state from the wrong entry, causing a later crash when the first stream was then closed. In addition, it does not remove the mesh state twice in two code paths, and initializes the http2 stream variable more, in an attempt at more correct administration of the mesh state and http2 stream.
The commit therefore fixes I think some additional cases, where multiple http2 streams in one http2 connection were in use, and one of the streams was closed. The fix passes unit tests, that test other things.
It's working for 24 hours without any fault. Many thanks for all this tried and as result to solve the problem. Thank you!
Hello, After upgrade unbound from 1.19.3 to 1.20.0 we have SIGSEGV in randomly moment, but all gathered traces about one place. This line has struct
cp
and it's ok, butn->query_reply.c
has invalid memory address. Details: (disclaimer: all lines near function don't equal upstream code, because we have a few internal patch) gdb:I rebuilt unbound with address sanitizer and gathered trace:
For clarity, that line
0 0x1532b06 (/unbound-1.20-sanitizer+0x1532b06) mesh_state_remove_reply unbound/services/mesh.c:2087
it's this place in upstreamI think it's some sort of race in libevent model and unbound with nghttp2. But i couldn't reproduce on test stand, only on user activity(after 1k rps) and this happens rarely at random moments, maybe one time per day.
System:
unbound -V
output: