android / ndk

The Android Native Development Kit
2k stars 257 forks source link

simpleperf: off cpu profiling doesn't work with --post-unwind option #587

Closed hardc0der closed 6 years ago

hardc0der commented 6 years ago

Description

When performing off-cpu profiling with the --trace-offcpu option if the --post-unwind option is also specified then it seems that the sched_switch tracepoint events aren't making it into the record file.

I am seeing this with the SimpleperfExamplePureJava app (https://android.googlesource.com/platform/system/extras/+/master/simpleperf/demo/SimpleperfExamplePureJava/).

This is what works: python app_profiler.py -p com.example.simpleperf.simpleperfexamplepurejava -a .SleepActivity -r "-e task-clock:u --call-graph dwarf -f 99 --trace-offcpu --duration 5 ./inferno.sh -sc

This is what doesn't work: python app_profiler.py -p com.example.simpleperf.simpleperfexamplepurejava -a .SleepActivity -r "-e task-clock:u --call-graph dwarf -f 99 --post-unwind --trace-offcpu --duration 5 ./inferno.sh -sc

I also used simpleperf dump on the record file generated in both cases and in the latter case I don't see any of the tracepoint events in the sample records.

Environment Details

NDK Version: Pkg.Revision = 16.0.4442984

yabincui commented 6 years ago

It is because in the second case, the records of sched:sched_switch are skipped by simpleperf. When -e task-clock:u instead of -e task-clock is used, users want to omit kernel samples. However, I find that when -e cpu-cycles:u is used, the kernel still dumps some kernel samples. So I added the logic to skip samples only with kernel callchains in simpleperf. And when --post-unwind is used, simpleperf doesn't generate user callchains until after recording. This makes simpleperf wrongly skips all sched_switch records.

If you use "-e task-clock --post-unwind ..." instead of "-e task-clock:u --post-unwind ...", it should work again. Instead of fixing it, I prefer to remove --post-unwind and --no-unwind option in the record cmd. Because:

  1. simpleperf has used unwinding while recording by default for a long time. And it works well.
  2. More time is taken by copying stack, not by the unwinder. So writing stack data to file actually takes more time than unwinding it.
  3. post-unwind and no-unwind takes more storage space.

Any different opinions?

hardc0der commented 6 years ago

Thanks for responding so quickly. A few thoughts:

  1. If you are skipping sample records with only kernel callchains then why does when the stack unwinding happens matter? I assume when you say you skip sample records with only kernel callchains they aren't processed and stored in the record file at all so I'm not following how when the unwinding happens makes a difference.
  2. I'm pretty sure I played around with including kernel samples before when using --trace-offcpu in conjunction with --post-uwnind but I think I saw inconsistent results in that the way you calculate the duration the task was off-cpu seemed incorrect but I'll try again to make sure. Also, when including kernel samples with doing off-cpu profiling it kind of made the flame graph more confusing when showing the off-cpu samples since it included the syscall and kernel stack.
  3. I'll have to do more analysis but I usually use --post-unwind and even then simpleperf seems to utilize quite a bit of CPU when sampling at even 497 Hz so when profiling a heavy workload I worry that unwinding during the recording might disturb the workload more. Also, wouldn't the post-unwinding be useful to be able to do the unwinding offline on the host so you don't have to push the symbols to the target before recording?
yabincui commented 6 years ago
  1. Sorry, I didn't explain this in details. the time to do stack unwinding matters, because the recording workflow is as below: ProcessRecord(Record r) if (!post_unwind) { UnwindRecord(r) // this converts user stack to user space call chains } if (exclude_kernel_callchains && r.GeneratedFromKernelSpace() && r.HasNoUserSpaceCallChain()) return Store the record in recording file.

With --post-unwind, UnwindRecord(r) is called after recording. According to the code above, records generated from kernel space are all skipped. The source code is at https://android.googlesource.com/platform/system/extras/+/master/simpleperf/cmd_record.cpp#928, but with git sha: b9214820bb9348e54fcb5013ae7ca8cc1c793141

  1. The code to calculate the duration of a sample with --trace-offcpu option is at https://android.googlesource.com/platform/system/extras/+/master/simpleperf/report_lib_interface.cpp#266. Note that with --trace-offcpu, it includes both on-cpu time and off-cpu, but it doesn't calculate pure off cpu time, as in https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/#record-both-on-cpu-time-and-off-cpu-time. Calculating pure off-cpu time is still a task to be done.

  2. With --post-unwind and --no-unwind, simpleperf needs to store 64K stack data on device for each sample. The memcpy itself can cost a lot of cpu. Instead, when unwinding while recording, simpleperf loads the debug information of needed binaries only once, then the debug info is cached in memory. So normally the cost is just executing the unwinding instructions. Using --call-graph dwarf can take a lof of cpu. But I think it is acceptable that the app is slower when profiling, as long as it is still workable. Simpleperf reports how many records are lost at the end of profiling. I think we can use higher frequency as far as the app is workable and simpleperf doesn't lost much records. Currently, the unwinder used by simpleperf can only unwind in the same arch as compiled. So the simpleperf on x86_64 host can't unwind a stack recorded in arm/arm64. And the debug info of java code and system libraries are all on device, so I prefer to suggest users put binaries with debug info (or mini debug info as system libraries) on device.

hardc0der commented 6 years ago
  1. I ran some more experiments and what you explained is definitely true for NDK 16 in that I can not use --post-unwind if I specify only user events with something like -e cpu-cycles:u. I also happen to build off of AOSP master and with this version of simpleperf what I was seeing then is that regardless of whether I choose user events only or kernel and user events and --post-unwind or not, ProcessRecord() was dropping the tracepoint samples because there were only kernel stack callchains (I added some log statements in cmd_record.cpp to confirm.

  2. Got it, thanks.

  3. I agree about the memory copy overhead although I'm not sure how much smaller the unwound callchain ends up being versus the 64K stack sample and I also worry about losing samples while you're still unwind records even with larger mmap buffers but I'm sure you've looked into this. With --call-graph dwarf doing systemwide recording at anything over 99 Hz on a heavily utilized system seems to cause a lot of samples to be lost (30%+) for me. Now that you mention it I remember reading somewhere in your documentation that the host can't unwind stacks for cross architectures so that does minimize the value of the no unwinding option.

yabincui commented 6 years ago
  1. In the AOSP master, the situation was worse because of https://android-review.googlesource.com/c/platform/system/extras/+/527373. test.py TestExamplePureJavaTraceOffCpu catches this, but I didn't run it when uploading the patch. And thanks to your reminding, I have fixed it in https://android-review.googlesource.com/c/platform/system/extras/+/560680.

  2. I had the impression that memcpy takes more time when I was optimizing the record cmd long time ago. Currently I don't have data about the time used by unwinding vs timed used by memcpy. Because simpleperf doesn't collect time statistics about unwinding. I plan to collect unwinding failure results and time used by unwinding in the near future. So it will be clear. I am not surprised about the high sample lost rate when doing system wide dwarf call graph recording. Currently I have no idea except by adjusting parameters: a. sample at a lower frequency for a longer time. b. try --call-graph fp instead. c. use cpu-cycles:u instead of cpu-cycles if you are not interested in kernel code. d. decrease dumped stack size, like --call-graph dwarf,8192.

hardc0der commented 6 years ago
  1. Indeed you are correct and I confirmed this is fixed when I pulled from master again.

Can you also clarify why the number of samples recorded using task-clock vs cpu-cycles is different for some threads? If I sample system-wide there are some of my tasks where the number of samples recorded using both events is about the same but there are a few tasks that show up with drastically different numbers of samples depending on which event I use to record.

yabincui commented 6 years ago

When using -f HZ option to set sample frequency (or by default using -f 4000), simpleperf asks the kernel to adjust the sample period of a event (how many events to generate one sample) to fulfill the sample frequency need. So comparing the number of samples using task-clock vs cpu-cycles only shows how good the kernel is at controlling the sample frequency of different event types.

And as I know, the kernel is better at controlling the sample frequency of task-clock than cpu-cycles. Because task-clock uses nanoseconds running, and that is easy to translate, sample_period = 10^9 / Hz. For cpu-cycles, the kernel tries to translate Hz to how many cpu cycles to generate one sample. But on Android, the cpu frequencies change with workload. So the kernel has to adjust the sample period when cpu frequency changes, and that's harder.

So instead of sample count, I recommend checking total event count. To make event count generated by different passes comparable, we usually force cpu online, fix cpu frequency, then use simpleperf stat command. One example is https://android.googlesource.com/platform/system/extras/+/master/app-launcher/.

When recording, we usually only use one event type, either cpu-cycles, or task-clock, or instructions. They all can show where the cpu is spent at, but have somewhat different results because of different implementations. cpu-cycles shows the consumption of cpu cycles, task-clock shows the consumption of cpu time, and instructions shows the consumption of instruction count. Now I prefer more in using task-clock, because it is easier to explain how much time we have spent in one function.

hardc0der commented 6 years ago

Thanks. Closing since you've addressed the issue for which this was created.