Open disprosium8 opened 8 years ago
Hmmmm. We've spent a bunch of time on the packaging for github.com/accre/lstore getting -dbginfo builds. I'll put porting those changes to the ibp server on that list
Is it reproducible at all?
It's dark in this basement.
Haven't been able to determine a trigger, every now and then we get a few depots that crash and require a restart. I'm getting a debug build going on ibp2 to see if we can narrow down the issue.
After some address sanitizer builds, I realized the unis registration struct was never correctly initialized, leading to memory corruption. Fixed in 26e060ebb82851fdddd7d4e61650fc5a7bd76cf1 and hoping that was the root cause.
So it looks like ibp_server is just overcommitting way too much memory, and eventually enough pages get touched that the process can no longer allocate, and malloc fails. This leads to crashes in various places where the return from malloc is not checked, e.g. this strdup https://github.com/datalogistics/ibp_server/blob/master/subnet.c#L59
The process on ibp2 just crashed again and it was at 260GB+ committed virtual memory according to our host monitoring. Are BDB allocations not getting unmapped or something?
Uh, 260GB is way too many GB. There's a slow leak that we've known about, but it's not been nearly that bad for our particular use pattern. Let me produce an ASAN build and get @tacketar to run it one of the big CMS depots to see if I can't trace down the heavy-hitters for this leak.
Finally got a core dump from some recent crashes of a server. No clue as to what string (some IP address) is causing the issue, will need to build a debug version to see if it's possible to recreate.