resource and job-manager bounded on rlist and idset performance for status requests

trws commented 1 month ago

I pulled a perf profile this morning on our largest deployment, mainly to see if there were anything that stood out when it isn't on fire. It seems like no jobs are currently being considered, the 9 pending must be blocked, but there are definitely users asking for status. Everything seems healthy, but a flux resource list takes ~3.5 seconds, and the traces in the job-manager and resource modules are particularly surprising:

Job-manager:
Resource:

In the call tree view, resource_status_cb in job-manager and prepare_sched_status_payload in resource appear as literally 100% of non-epoll-waiting samples for 5-10 of the 15 seconds of the trace. A single call to resource_status_db is shown as taking 4.2 seconds in job-manager, 2.4 seconds of which is rlist_destroy of all things (though that may be partly because of malloc consolidation in glibc). On the resource side time is dominated by rlist_from_json and rlist_to_R, perf is attributing this to get_empty_set but that makes no sense, presumably it's actually get_all or get_down. Load was low when I took this, and almost all that time seems to come from rlist_compressed->rlist_mrlist->fzlistx_find.

Anyway, the largest traces land on idset ops and rlist ops of one kind or another. Happy to share this trace if anyone wants it.

garlick commented 1 month ago

Ooh interesting.

Well, #6105 just went in last week to hopefully avoid some of this.

trws commented 1 month ago

Excellent, that sound definitely help! I'll try to take a trace on the new version and see where we end up, the rlist_to_R parts might still be meaningful if I'm reading the new code correctly, but always better to measure.

grondo commented 1 month ago

To get the correct time for a status/list response, you should check the low-level RPC call. Probably half the time is spent in Python assembling the results, though we vastly improved that awhile ago.

grondo commented 1 month ago

Though the rlist_mrlist->zlistx_find seems like very low hanging fruit.

trws commented 1 month ago

That's a good point @grondo, I'll see what I can get with a short test, give it 20 minutes or something. As to the low-hanging fruit, If it's reasonable to use a zhashx there instead of a zlist that should be a huge improvement all by itself for when we need to do that op.

grondo commented 1 month ago

I think I looked at this before and decided it wasn't worth it at the time because other issues were 2-10x the cost. However, definitely take another look.

trws commented 1 month ago

Updated trace is quite different, this is a fake workload on my workstation where I start 10 workers that each run flux resource list 20 times sequentially, so 200 requests, for a 16,000 node cluster with all evenly numbered nodes drained (can't let it have big contiguous ranges). There are no resource updates during the test, so it's as cache friendly as possible.

The resource module now spends the vast majority of its time, 70%, in rlist_json_properties, 57% in rlist_properties.

The job_manager module is now spending 20% in rlist_to_R, and 43% in flux_respond_pack. Looks like a big chunk of that, 27% of the thread's overall time, is doing json_dumps. Maybe worth it to pre-dump the actual R and keep the string around if it isn't hard?

Since this isn't sensitive and it's small when compressed here's the trace: current_core_noranges.perf.gz

Firefox profiler or flamegraph or even speedscope should be able to handle the format.

flux-framework / flux-core

resource and job-manager bounded on rlist and idset performance for status requests #6135