Closed dankamongmen closed 4 years ago
#0 0x00007ffff77b5e00 in __memcmp_avx2_movbe () from /usr/lib/libc.so.6
#1 0x000055555555feca in lookup_global_l3host (fam=fam@entry=2,
addr=addr@entry=0x7fffd0004d50) at src/omphalos/netaddrs.c:399
#2 0x0000555555568114 in offer_wresolution (fam=fam@entry=2,
addr=addr@entry=0x7fffd0004d50,
name=name@entry=0x7fffd000bea0 L"talk.l.google.com",
nlevel=nlevel@entry=NAMING_LEVEL_DNS, nsfam=nsfam@entry=2,
nameserver=nameserver@entry=0x7fffdd439cb0) at src/omphalos/resolv.c:159
#3 0x00005555555681d0 in offer_resolution (fam=fam@entry=2,
addr=addr@entry=0x7fffd0004d50, name=<optimized out>,
name@entry=0x7fffd00046e0 "talk.l.google.com",
nlevel=nlevel@entry=NAMING_LEVEL_DNS, nsfam=nsfam@entry=2,
nameserver=nameserver@entry=0x7fffdd439cb0) at src/omphalos/resolv.c:143
#4 0x000055555556aeb2 in handle_dns_packet (op=0x7fffdd439d90,
frame=0x7fffddfe306c, len=21) at src/omphalos/dns.c:542
#5 0x0000555555568cfb in handle_udp_packet (op=0x7fffdd439d90,
frame=<optimized out>, len=<optimized out>) at src/omphalos/udp.c:44
#6 0x000055555556f232 in handle_ring_packet (
iface=0x555555717758 <interfaces+14232>, fd=<optimized out>,
frame=0x7fffddfe3042, frame@entry=0x7fffddfe3000)
at src/omphalos/psocket.c:254
#7 0x0000555555563849 in ring_packet_loop (pm=0x555569655b20,
pm=0x555569655b20) at src/omphalos/netlink.c:433
#8 psocket_thread (unsafe=0x555569655b20) at src/omphalos/netlink.c:460
#9 psocket_thread (unsafe=0x555569655b20) at src/omphalos/netlink.c:448
#10 0x00007ffff78284cf in start_thread () from /usr/lib/libpthread.so.0
#11 0x00007ffff77572d3 in clone () from /usr/lib/libc.so.6
Ahhhh yes there's a note from early-middle-aged me about it in lookup_global_l3host()'s body, and this chestnut:
// Browse the global list. Don't create the host if it doesn't exist. Since
// references are handed out without a lock held, we cannot destroy an l3host
// which is on the global list! This is fundamentally unsafe, really FIXME.
struct l3host *lookup_global_l3host(int fam,const void *addr){
i suck!
This is on my laptop while at home, btw. It's not a heavy-traffic situation at all.
I very rarely (if ever?) see this on my Debian workstation, btw. Odd.
still all over the place. i have a hard time recommending use of this tool while this is going down. :/
==253393== Memcheck, a memory error detector
==253393== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==253393== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==253393== Command: out/omphalos/omphalos-ncurses -u --usbids=usb.ids --ouis=ieee-oui.txt
==253393==
==253393== Warning: set address range perms: large range [0x146000, 0x12b8c000) (defined)
Reloaded 2994 vendors and 16449 USB devices from usb.ids in 0.0166006s
Reloaded 24910 OUIs from ieee-oui.txt in 0.325789s
Reloaded 2 resolvers from /etc/resolv.conf in 0.0002607s
Entering ncurses mode...
^[[5~==253393== Thread 4:
==253393== Invalid read of size 1
==253393== at 0x133CAEB9: bcmp (vg_replace_strmem.c:1113)
==253393== by 0x113F09: lookup_global_l3host (netaddrs.c:399)
==253393== by 0x11C133: offer_wresolution (resolv.c:159)
==253393== by 0x11C1EF: offer_resolution (resolv.c:143)
==253393== by 0x11EF21: handle_dns_packet (dns.c:542)
==253393== by 0x11CD6A: handle_udp_packet (udp.c:44)
==253393== by 0x123311: handle_ring_packet (psocket.c:254)
==253393== by 0x117888: ring_packet_loop (netlink.c:434)
==253393== by 0x117888: psocket_thread (netlink.c:461)
==253393== by 0x117888: psocket_thread (netlink.c:449)
==253393== by 0x1358B4CE: start_thread (in /usr/lib/libpthread-2.30.so)
==253393== by 0x136A32D2: clone (in /usr/lib/libc-2.30.so)
==253393== Address 0x1be2910c is 12 bytes inside a block of size 152 free'd
==253393== at 0x133C59AB: free (vg_replace_malloc.c:540)
==253393== by 0x114369: cleanup_l3hosts (netaddrs.c:521)
==253393== by 0x12031E: free_iface (interface.c:226)
==253393== by 0x11803E: handle_rtm_dellink (netlink.c:407)
==253393== by 0x11803E: handle_netlink_event (netlink.c:1067)
==253393== by 0x119712: netlink_thread (netlink.c:1161)
==253393== by 0x119712: handle_netlink_socket (netlink.c:1179)
==253393== by 0x116FE5: omphalos_init (omphalos.c:360)
==253393== by 0x10F566: main (ncurses.c:677)
==253393== Block was alloc'd at
==253393== at 0x133C477F: malloc (vg_replace_malloc.c:309)
==253393== by 0x113938: MMalloc (util.h:20)
==253393== by 0x113938: create_l3host (netaddrs.c:109)
==253393== by 0x113938: lookup_l3host_common (netaddrs.c:327)
==253393== by 0x114059: lookup_local_l3host (netaddrs.c:435)
==253393== by 0x1131C2: tx_mdns_ptr (mdns.c:106)
==253393== by 0x11BF3C: queue_for_naming (resolv.c:94)
==253393== by 0x113AD9: lookup_l3host_common (netaddrs.c:364)
==253393== by 0x114059: lookup_local_l3host (netaddrs.c:435)
==253393== by 0x1207C7: add_route4 (interface.c:309)
==253393== by 0x124EC6: handle_rtm_newroute (route.c:257)
==253393== by 0x1181F1: handle_netlink_event (netlink.c:1073)
==253393== by 0x119712: netlink_thread (netlink.c:1161)
==253393== by 0x119712: handle_netlink_socket (netlink.c:1179)
==253393== by 0x116FE5: omphalos_init (omphalos.c:360)
==253393==
==253393== Invalid read of size 8
==253393== at 0x113EF0: lookup_global_l3host (netaddrs.c:398)
==253393== by 0x11C133: offer_wresolution (resolv.c:159)
==253393== by 0x11C1EF: offer_resolution (resolv.c:143)
==253393== by 0x11EF21: handle_dns_packet (dns.c:542)
==253393== by 0x11CD6A: handle_udp_packet (udp.c:44)
==253393== by 0x123311: handle_ring_packet (psocket.c:254)
==253393== by 0x117888: ring_packet_loop (netlink.c:434)
==253393== by 0x117888: psocket_thread (netlink.c:461)
==253393== by 0x117888: psocket_thread (netlink.c:449)
==253393== by 0x1358B4CE: start_thread (in /usr/lib/libpthread-2.30.so)
==253393== by 0x136A32D2: clone (in /usr/lib/libc-2.30.so)
==253393== Address 0x1be29168 is 104 bytes inside a block of size 152 free'd
==253393== at 0x133C59AB: free (vg_replace_malloc.c:540)
==253393== by 0x114369: cleanup_l3hosts (netaddrs.c:521)
==253393== by 0x12031E: free_iface (interface.c:226)
==253393== by 0x11803E: handle_rtm_dellink (netlink.c:407)
==253393== by 0x11803E: handle_netlink_event (netlink.c:1067)
==253393== by 0x119712: netlink_thread (netlink.c:1161)
==253393== by 0x119712: handle_netlink_socket (netlink.c:1179)
==253393== by 0x116FE5: omphalos_init (omphalos.c:360)
==253393== by 0x10F566: main (ncurses.c:677)
==253393== Block was alloc'd at
==253393== at 0x133C477F: malloc (vg_replace_malloc.c:309)
==253393== by 0x113938: MMalloc (util.h:20)
==253393== by 0x113938: create_l3host (netaddrs.c:109)
==253393== by 0x113938: lookup_l3host_common (netaddrs.c:327)
==253393== by 0x114059: lookup_local_l3host (netaddrs.c:435)
==253393== by 0x1131C2: tx_mdns_ptr (mdns.c:106)
==253393== by 0x11BF3C: queue_for_naming (resolv.c:94)
==253393== by 0x113AD9: lookup_l3host_common (netaddrs.c:364)
==253393== by 0x114059: lookup_local_l3host (netaddrs.c:435)
==253393== by 0x1207C7: add_route4 (interface.c:309)
==253393== by 0x124EC6: handle_rtm_newroute (route.c:257)
==253393== by 0x1181F1: handle_netlink_event (netlink.c:1073)
==253393== by 0x119712: netlink_thread (netlink.c:1161)
==253393== by 0x119712: handle_netlink_socket (netlink.c:1179)
==253393== by 0x116FE5: omphalos_init (omphalos.c:360)
==253393==
==253393== Invalid read of size 8
==253393== at 0x113EF0: lookup_global_l3host (netaddrs.c:398)
==253393== by 0x11C133: offer_wresolution (resolv.c:159)
==253393== by 0x11C1EF: offer_resolution (resolv.c:143)
==253393== by 0x11EEBD: handle_dns_packet (dns.c:526)
==253393== by 0x1130E8: handle_mdns_packet (mdns.c:81)
==253393== by 0x123311: handle_ring_packet (psocket.c:254)
==253393== by 0x117888: ring_packet_loop (netlink.c:434)
==253393== by 0x117888: psocket_thread (netlink.c:461)
==253393== by 0x117888: psocket_thread (netlink.c:449)
==253393== by 0x1358B4CE: start_thread (in /usr/lib/libpthread-2.30.so)
==253393== by 0x136A32D2: clone (in /usr/lib/libc-2.30.so)
==253393== Address 0x1be24258 is 104 bytes inside a block of size 152 free'd
==253393== at 0x133C59AB: free (vg_replace_malloc.c:540)
==253393== by 0x114369: cleanup_l3hosts (netaddrs.c:521)
==253393== by 0x120312: free_iface (interface.c:225)
==253393== by 0x11803E: handle_rtm_dellink (netlink.c:407)
==253393== by 0x11803E: handle_netlink_event (netlink.c:1067)
==253393== by 0x119712: netlink_thread (netlink.c:1161)
==253393== by 0x119712: handle_netlink_socket (netlink.c:1179)
==253393== by 0x116FE5: omphalos_init (omphalos.c:360)
==253393== by 0x10F566: main (ncurses.c:677)
==253393== Block was alloc'd at
==253393== at 0x133C477F: malloc (vg_replace_malloc.c:309)
==253393== by 0x113938: MMalloc (util.h:20)
==253393== by 0x113938: create_l3host (netaddrs.c:109)
==253393== by 0x113938: lookup_l3host_common (netaddrs.c:327)
==253393== by 0x114059: lookup_local_l3host (netaddrs.c:435)
==253393== by 0x1132BD: tx_mdns_ptr (mdns.c:130)
==253393== by 0x11BF3C: queue_for_naming (resolv.c:94)
==253393== by 0x113AD9: lookup_l3host_common (netaddrs.c:364)
==253393== by 0x114059: lookup_local_l3host (netaddrs.c:435)
==253393== by 0x1209EB: add_route6 (interface.c:365)
==253393== by 0x117FD9: handle_rtm_newaddr (netlink.c:389)
==253393== by 0x117FD9: handle_netlink_event (netlink.c:1077)
==253393== by 0x119712: netlink_thread (netlink.c:1161)
==253393== by 0x119712: handle_netlink_socket (netlink.c:1179)
==253393== by 0x116FE5: omphalos_init (omphalos.c:360)
==253393== by 0x10F566: main (ncurses.c:677)
==253393==
==253393== Invalid read of size 1
==253393== at 0x133CAEB9: bcmp (vg_replace_strmem.c:1113)
==253393== by 0x113F09: lookup_global_l3host (netaddrs.c:399)
==253393== by 0x11C133: offer_wresolution (resolv.c:159)
==253393== by 0x11F104: handle_dns_packet (dns.c:471)
==253393== by 0x11CD6A: handle_udp_packet (udp.c:44)
==253393== by 0x123311: handle_ring_packet (psocket.c:254)
==253393== by 0x117888: ring_packet_loop (netlink.c:434)
==253393== by 0x117888: psocket_thread (netlink.c:461)
==253393== by 0x117888: psocket_thread (netlink.c:449)
==253393== by 0x1358B4CE: start_thread (in /usr/lib/libpthread-2.30.so)
==253393== by 0x136A32D2: clone (in /usr/lib/libc-2.30.so)
==253393== Address 0x1be241fc is 12 bytes inside a block of size 152 free'd
==253393== at 0x133C59AB: free (vg_replace_malloc.c:540)
==253393== by 0x114369: cleanup_l3hosts (netaddrs.c:521)
==253393== by 0x120312: free_iface (interface.c:225)
==253393== by 0x11803E: handle_rtm_dellink (netlink.c:407)
==253393== by 0x11803E: handle_netlink_event (netlink.c:1067)
==253393== by 0x119712: netlink_thread (netlink.c:1161)
==253393== by 0x119712: handle_netlink_socket (netlink.c:1179)
==253393== by 0x116FE5: omphalos_init (omphalos.c:360)
==253393== by 0x10F566: main (ncurses.c:677)
==253393== Block was alloc'd at
==253393== at 0x133C477F: malloc (vg_replace_malloc.c:309)
==253393== by 0x113938: MMalloc (util.h:20)
==253393== by 0x113938: create_l3host (netaddrs.c:109)
==253393== by 0x113938: lookup_l3host_common (netaddrs.c:327)
==253393== by 0x114059: lookup_local_l3host (netaddrs.c:435)
==253393== by 0x1132BD: tx_mdns_ptr (mdns.c:130)
==253393== by 0x11BF3C: queue_for_naming (resolv.c:94)
==253393== by 0x113AD9: lookup_l3host_common (netaddrs.c:364)
==253393== by 0x114059: lookup_local_l3host (netaddrs.c:435)
==253393== by 0x1209EB: add_route6 (interface.c:365)
==253393== by 0x117FD9: handle_rtm_newaddr (netlink.c:389)
==253393== by 0x117FD9: handle_netlink_event (netlink.c:1077)
==253393== by 0x119712: netlink_thread (netlink.c:1161)
==253393== by 0x119712: handle_netlink_socket (netlink.c:1179)
==253393== by 0x116FE5: omphalos_init (omphalos.c:360)
==253393== by 0x10F566: main (ncurses.c:677)
==253393==
==253393== Invalid read of size 8
==253393== at 0x113EF0: lookup_global_l3host (netaddrs.c:398)
==253393== by 0x11C133: offer_wresolution (resolv.c:159)
==253393== by 0x11F104: handle_dns_packet (dns.c:471)
==253393== by 0x11CD6A: handle_udp_packet (udp.c:44)
==253393== by 0x123311: handle_ring_packet (psocket.c:254)
==253393== by 0x117888: ring_packet_loop (netlink.c:434)
==253393== by 0x117888: psocket_thread (netlink.c:461)
==253393== by 0x117888: psocket_thread (netlink.c:449)
==253393== by 0x1358B4CE: start_thread (in /usr/lib/libpthread-2.30.so)
==253393== by 0x136A32D2: clone (in /usr/lib/libc-2.30.so)
==253393== Address 0x1be24258 is 104 bytes inside a block of size 152 free'd
==253393== at 0x133C59AB: free (vg_replace_malloc.c:540)
==253393== by 0x114369: cleanup_l3hosts (netaddrs.c:521)
==253393== by 0x120312: free_iface (interface.c:225)
==253393== by 0x11803E: handle_rtm_dellink (netlink.c:407)
==253393== by 0x11803E: handle_netlink_event (netlink.c:1067)
==253393== by 0x119712: netlink_thread (netlink.c:1161)
==253393== by 0x119712: handle_netlink_socket (netlink.c:1179)
==253393== by 0x116FE5: omphalos_init (omphalos.c:360)
==253393== by 0x10F566: main (ncurses.c:677)
==253393== Block was alloc'd at
==253393== at 0x133C477F: malloc (vg_replace_malloc.c:309)
==253393== by 0x113938: MMalloc (util.h:20)
==253393== by 0x113938: create_l3host (netaddrs.c:109)
==253393== by 0x113938: lookup_l3host_common (netaddrs.c:327)
==253393== by 0x114059: lookup_local_l3host (netaddrs.c:435)
==253393== by 0x1132BD: tx_mdns_ptr (mdns.c:130)
==253393== by 0x11BF3C: queue_for_naming (resolv.c:94)
==253393== by 0x113AD9: lookup_l3host_common (netaddrs.c:364)
==253393== by 0x114059: lookup_local_l3host (netaddrs.c:435)
==253393== by 0x1209EB: add_route6 (interface.c:365)
==253393== by 0x117FD9: handle_rtm_newaddr (netlink.c:389)
==253393== by 0x117FD9: handle_netlink_event (netlink.c:1077)
==253393== by 0x119712: netlink_thread (netlink.c:1161)
==253393== by 0x119712: handle_netlink_socket (netlink.c:1179)
==253393== by 0x116FE5: omphalos_init (omphalos.c:360)
==253393== by 0x10F566: main (ncurses.c:677)
==253393==
I just committed a9f3d12ac2ad3c96933b1480ece83f9ac727482f, which might fix this. We were improperly using the result of mbrtowcs() on the resolution path. This would result in an improperly-terminated hostname. The first one we see above is a properly-terminated talk.google.com, but it could be getting compared to something already in the structure. Just a thought. I've got omphalos running on a few machines now to see if we can still reproduce the problem. Fingers crossed!
Nope, still there :(.
I'm not quite certain we're seeing this anymore since the Notcurses switch. None of our stack traces show any real UI work, but I also don't seem able to reproduce it in 0.99.13-pre's notcurses build...hrm.
Overnight stress tests passed without problem...
Nope, we've still got it :/.
I saw this today as my laptop came out of suspend mode -- like, it started refreshing the display, and then BOOM went down. That could possibly point to a certain work level -- the neighbor tables are presumably being rapidly repopulated, packets are being transmitted, etc. This could be a false lead, though, so don't put too much stock in it.
If you look at the valgrind output above (from 2020-01-10), all the faulty accesses are into areas freed in free_iface()
, coming down through handle_rtm_dellink()
. That would definitely point right at an interface down event, which could be a suspend -- as suggested above -- or a VPN coming down -- necessary for non-suspend environments where we've seen this.
Yeah, that's really promising, especially given this note from ourselves:
// Browse the global list. Don't create the host if it doesn't exist. Since
// references are handed out without a lock held, we cannot destroy an l3host
// which is on the global list! This is fundamentally unsafe, really FIXME.
that actually looks like the only place where hosts/ifaces are destroyed. maybe just retain them? i'm gonna pull that and see what happens.
Yeah, that got it. OK, we now just need figure out how to recycle L2/L3 hosts when a device is brought down. Though maybe we shouldn't, and just drive them all via a global LRU? Yeah, do that.
with the newest 0.99.9-pre candidate, I've verified that we've resolved the crash-on-pull-ups problem. Much more infrequently, I now see a different segfault when I just leave it up long enough. I'm working to get a stacktrace now.