faster-cpython / ideas

1.67k stars 49 forks source link

Make Linux perf profiling work well with the JIT #659

Closed mdboom closed 3 months ago

mdboom commented 4 months ago

After a first run of using Linux perf to profile pyperformance on a JIT-enabled build of CPython, it seems that JIT frames show up as distinct from everything else, which is great. We get a bunch of entries that look like this:

self,children,object,symbol
0.03,0.03,[JIT],3625870                                 [.] 0x00007fd0c4c67129

(I assume that's some sort of unique id, and a pointer to the code page??)

Normal C function frames, for example, look like this:

20.38,42.96,python,_PyEval_EvalFrameDefault

So we are actually able to "see" when JIT code is the inner most frame when perf collects a sample.

However, I think it's not quite working right. Whenever using sampling based profiling, there is always a certain percentage of the overall time that is "unaccounted for", because there is a long tail of call sites that are reported as 0.00%, but nonetheless take some non-zero time. As far as I can gather linux perf has a fixed precision of 100ths of a percent. Traditionally in our data the unaccounted for was 1-2% of the time. However, with a JIT run, 1.5% of the time is in these [JIT] frames, but the unaccounted for time jumps to 12%. So I think what is happening is we have this now extremely long tail of individual [JIT] functions that aren't actually getting accounted for, and the time spent in the JIT is actually much higher than 1.5%.

I'm not sure how to solve this. Maybe there's some way to "coerce" all of the [JIT] entries into one. If not, I would hate to have to fork the perf report to get better precision out and/or write my own perf data parser. (Though maybe the perf data format is simple enough, I have no idea).

Cc @brandtbucher

mdboom commented 4 months ago

It looks like you can get the "raw" period time from perf report using --show-total-period and then we can calculate the percentage ourselves to whatever precision we want.

mdboom commented 4 months ago

This solution seems to work really well. Here are the results (just for the hexiom benchmark -- not overall, which still takes ~12 hours to compute):

image

See this PR: https://github.com/faster-cpython/bench_runner/pull/148

brandtbucher commented 4 months ago

Awesome! I wonder if they're just parsing /proc/self/maps to find our JIT pages.

markshannon commented 4 months ago

Very nice.

Perf just assumes mmaped executable pages are "JIT"? That's handy if it does.