criblio / appscope

Gain observability into any Linux command or application with no code modification
https://appscope.dev
Apache License 2.0
268 stars 33 forks source link

Fix memory leaks when the circular buffer is full #1521

Closed jrcheli closed 1 year ago

jrcheli commented 1 year ago

When scoping nginx, and we don't have a connection to send events out, the libraries' circular buffer will eventually fill up. When this happens, we free events but not the event data they contain. Hitting nginx with "ab" is an easy way to put the library in this stressed position where we leak memory.

jrcheli commented 1 year ago

1)

If the event circular buffer fills up, then we're not fully freeing the events that are getting dropped on the datapath side. I have a fix for this on the bug/1521-mem-leaks branch.

2)

On the reporting side, we A) read from the event circular buffer, and B) generate both events and metrics, and C) send both events and metrics. Right or wrong, I expected that things would stay on the event circular buffer until an event destination is connected, but this isn't true. If events and metrics are routed to different locations and the metric destination is configured to be udp, we'll read from the event circular buffer for the sake of metrics. From my perspective our behavior implies that we've decided that metric data is more important than event data. Right or wrong, I feel like the opposite is true. I don't have a suggestion here, except to double check that we want the behavior the way we have it now.

3)

We have one thread reading off of the event circular buffer on the reporting side. If A) the datapath tries to put things on this queue faster than the reporting side can consume them, and B) the event circular buffer fills up then we find ourselves in an ugly state, particularly w.r.t. http. Why http? It's unique in that we try to keep some state from an http request that is only cleaned up when we marry it to it's paired http response. (g_maplist)

Say we have a sequence events: http request A, http response A, http request B, http response B. If the event circular buffer was full and the reporting thread wasn't actively trying to read, all four of these would end up on the floor. Ideally we wouldn't lose data, but this scenario isn't the ugly one. Since we lost both each request and it's paired response, we won't accumulate cruft on the reporting side.

Where it gets ugly is when the circular buffer is full, but the reporting thread consumes events so that there are gaps when the datapath intermittently can add one event to the event buffer but not the next. In this state the event circular buffer can end up with http requests without their corresponding responses, and responses without their corresponding requests. When this happens we'll accumulate cruft. Stuff allocated and added to the g_maplist structure and never freed, causing a second kind of leak. Possible improvements that can fix half of the problem for http1... on the datapath side if we can't add an http request to the circular buffer, don't even attempt to add the associated http response. Somewhat redundant with this, on the reporting side, don't create a map object for http responses, only requests

jrcheli commented 1 year ago

With the first commit, made sure we're doing the same free logic when the event circular buffer is full on the datapath side as when we're processing events on the reporting side. Testing for this:

cd test/integration
make nginx-shell
unset SCOPE_EVENT_DEST SCOPE_METRIC_DEST
export SCOPE_EVENT_LOGFILE=false SCOPE_EVENT_CONSOLE=false SCOPE_EVENT_METRIC=false
apt update && apt install -y apache2-utils socat
scope -z nginx && top
In another terminal in the container:
ab -n 1000000 -c 3 http://localhost:80/

using top, observed peak RSS for nginx go from 8.5 G (before this commit) to .5G (after this commit) 

With the second commit, updated Makefille so unit tests can build successfully. With the third commit, allowed http.resp events to be created even if a map object can not be found. Also moved a couple of potentially large allocations from the stack to heap. Testing for this:

o) First, reran the testing for the first commit to verify that this commit does not have an effect on memory usage.  (no worse)

o) Second, reran the testing for the first commit again, but running this socat command in a third terminal into the container before starting nginx so the contents of the data could be verified.  `socat - TCP-LISTEN:9109,reuseaddr,fork | grep http.resp`.  What I was looking for is that some some http.resp events contain "target", while others do not.  ("target" is from the request).

o) Third, to verify that http metrics (always aggregated) still work...  I reran the testing for the first commit, but added some environment variables to nginx, and ran with a differently filtered socat command so I could look at metric output.  I was looking for any metric output from socat.
export SCOPE_CRIBL_ENABLE=true
export SCOPE_CRIBL=tcp://localhost:9109
Using `socat - TCP-LISTEN:9109,reuseaddr,fork | grep http | grep "\"type\":\"metric\""`

With the fourth commit, added code to break out of the doEvent() loop so other processing can not get starved indefinitely. Also refined logic to make sure that we are not processing http protocol data 1) if SCOPE_EVENT_ENABLE=false and SCOPE_EVENT_HTTP=true or SCOPE_METRIC_ENABLE=false and SCOPE_METRIC_HTTP=true. (Basically the "master enable" for events and metrics was being ignored before. Now it will be honored. Testing for this:

cd test/integration
make nginx-shell
export SCOPE_EVENT_LOGFILE=false SCOPE_EVENT_CONSOLE=false SCOPE_EVENT_METRIC=false
export SCOPE_CRIBL_ENABLE=true
export SCOPE_CRIBL=tcp://localhost:9109
apt update && apt install -y socat
scope -z top -d 0.01

In a second terminal in the container, after top's RES memory seems to stop climbing, 
socat - TCP-LISTEN:9109,reuseaddr,fork 

Then in a third terminal into the container, as soon as socat starts receiving data, repeatedly run:
scope inspect `pidof top`

The thing we're trying to observe is that the scope inspect command continues to work without too many timeouts.  I think some timeouts are acceptable, but ideally we wouldn't see many of them.
jrcheli commented 1 year ago

Just to document the final state of https://github.com/criblio/appscope/issues/1521#issuecomment-1591975835 above...

1) Addressed by PR https://github.com/criblio/appscope/pull/1551 2) No action needed. It's really just an open ended question that doesn't require any change at this time. 3) Will be addressed by https://github.com/criblio/appscope/issues/1557