memory leak on status update

trws commented 4 months ago

The json_t * variables R_all, R_down and R_alloc in status_request_cb are only freed by json_decref on the error path. The caches in the ctx (mr{all,down,alloc}) are never freed at all. Found this thanks to a 10k leak heaptrack found on one of our drained resource tests, tiny in the test, but could get larger over time if there are a lot of requests to this RPC, or a lot of resource updates, or especially both.

Plan:

add a lifecycle management type for json_t so this can't happen by accident anymore, either a smart pointer wrapper with json_decref as the free function or an actual functional wrapper like the janssoncpp headers provide
add the same for flux-core with __attribute__((cleanup)) so we can be safer about all of this

This is a priority bugfix, if I can get it together today I will, and we should try to get it deployed ASAP.

milroy commented 4 months ago

I don't think the sched-fluxion-resource.status RPC is used anymore, since I believe all the functionality is handled by core now.

The caches in the ctx (mr{all,down,alloc}) are never freed at all.

The member data is updated in status_request_cb whenever the values change or a configured amount of time has elapsed, but yeah, they aren't freed when Fluxion is stopped.

trws commented 4 months ago

Mark mentioned the same thing, and I agree it shouldn't be (at least not often) but it shows up in my recent heaptrack leak report. Will check again, try to get more detail here.

garlick commented 4 months ago

It's not used by default anymore, but we can check the scheduler's view of resources by running e.g.

FLUX_RESOURCE_LIST_RPC=sched.resource-status flux resource list

which can be handy. I'd say don't get rid of it :-)

trws commented 4 months ago

Yup, not planning to get rid of it, do think we need to fix the leak though, and rather wondering what's poking it.

flux-framework / flux-sched

memory leak on status update #1245