datalogistics / ibp_server

3 stars 3 forks source link

seg fault in string parsing #30

Open disprosium8 opened 8 years ago

disprosium8 commented 8 years ago

Finally got a core dump from some recent crashes of a server. No clue as to what string (some IP address) is causing the issue, will need to build a debug version to see if it's possible to recreate.

Program terminated with signal 11, Segmentation fault.
#0  0x00007f9ca85be46b in strtok_r () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install accre-ibp-server-2.0-1.el7.x86_64
(gdb) bt
#0  0x00007f9ca85be46b in strtok_r () from /lib64/libc.so.6
#1  0x000000000043f2ab in string_token ()
#2  0x000000000042ec27 in ipdecstr2address ()
#3  0x0000000000417928 in worker_task ()
#4  0x00007f9ca9e04dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f9ca862c28d in clone () from /lib64/libc.so.6
PerilousApricot commented 8 years ago

Hmmmm. We've spent a bunch of time on the packaging for github.com/accre/lstore getting -dbginfo builds. I'll put porting those changes to the ibp server on that list

Is it reproducible at all?

It's dark in this basement.

disprosium8 commented 8 years ago

Haven't been able to determine a trigger, every now and then we get a few depots that crash and require a restart. I'm getting a debug build going on ibp2 to see if we can narrow down the issue.

disprosium8 commented 8 years ago

After some address sanitizer builds, I realized the unis registration struct was never correctly initialized, leading to memory corruption. Fixed in 26e060ebb82851fdddd7d4e61650fc5a7bd76cf1 and hoping that was the root cause.

disprosium8 commented 8 years ago

So it looks like ibp_server is just overcommitting way too much memory, and eventually enough pages get touched that the process can no longer allocate, and malloc fails. This leads to crashes in various places where the return from malloc is not checked, e.g. this strdup https://github.com/datalogistics/ibp_server/blob/master/subnet.c#L59

The process on ibp2 just crashed again and it was at 260GB+ committed virtual memory according to our host monitoring. Are BDB allocations not getting unmapped or something?

PerilousApricot commented 8 years ago

Uh, 260GB is way too many GB. There's a slow leak that we've known about, but it's not been nearly that bad for our particular use pattern. Let me produce an ASAN build and get @tacketar to run it one of the big CMS depots to see if I can't trace down the heavy-hitters for this leak.