Closed hardc0der closed 6 years ago
It is because in the second case, the records of sched:sched_switch are skipped by simpleperf. When -e task-clock:u instead of -e task-clock is used, users want to omit kernel samples. However, I find that when -e cpu-cycles:u is used, the kernel still dumps some kernel samples. So I added the logic to skip samples only with kernel callchains in simpleperf. And when --post-unwind is used, simpleperf doesn't generate user callchains until after recording. This makes simpleperf wrongly skips all sched_switch records.
If you use "-e task-clock --post-unwind ..." instead of "-e task-clock:u --post-unwind ...", it should work again. Instead of fixing it, I prefer to remove --post-unwind and --no-unwind option in the record cmd. Because:
Any different opinions?
Thanks for responding so quickly. A few thoughts:
--trace-offcpu
in conjunction with --post-uwnind
but I think I saw inconsistent results in that the way you calculate the duration the task was off-cpu seemed incorrect but I'll try again to make sure. Also, when including kernel samples with doing off-cpu profiling it kind of made the flame graph more confusing when showing the off-cpu samples since it included the syscall and kernel stack.--post-unwind
and even then simpleperf
seems to utilize quite a bit of CPU when sampling at even 497 Hz so when profiling a heavy workload I worry that unwinding during the recording might disturb the workload more. Also, wouldn't the post-unwinding be useful to be able to do the unwinding offline on the host so you don't have to push the symbols to the target before recording?With --post-unwind, UnwindRecord(r) is called after recording. According to the code above, records generated from kernel space are all skipped. The source code is at https://android.googlesource.com/platform/system/extras/+/master/simpleperf/cmd_record.cpp#928, but with git sha: b9214820bb9348e54fcb5013ae7ca8cc1c793141
The code to calculate the duration of a sample with --trace-offcpu option is at https://android.googlesource.com/platform/system/extras/+/master/simpleperf/report_lib_interface.cpp#266. Note that with --trace-offcpu, it includes both on-cpu time and off-cpu, but it doesn't calculate pure off cpu time, as in https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/#record-both-on-cpu-time-and-off-cpu-time. Calculating pure off-cpu time is still a task to be done.
With --post-unwind and --no-unwind, simpleperf needs to store 64K stack data on device for each sample. The memcpy itself can cost a lot of cpu. Instead, when unwinding while recording, simpleperf loads the debug information of needed binaries only once, then the debug info is cached in memory. So normally the cost is just executing the unwinding instructions. Using --call-graph dwarf can take a lof of cpu. But I think it is acceptable that the app is slower when profiling, as long as it is still workable. Simpleperf reports how many records are lost at the end of profiling. I think we can use higher frequency as far as the app is workable and simpleperf doesn't lost much records. Currently, the unwinder used by simpleperf can only unwind in the same arch as compiled. So the simpleperf on x86_64 host can't unwind a stack recorded in arm/arm64. And the debug info of java code and system libraries are all on device, so I prefer to suggest users put binaries with debug info (or mini debug info as system libraries) on device.
I ran some more experiments and what you explained is definitely true for NDK 16 in that I can not use --post-unwind
if I specify only user events with something like -e cpu-cycles:u
.
I also happen to build off of AOSP master and with this version of simpleperf
what I was seeing then is that regardless of whether I choose user events only or kernel and user events and --post-unwind
or not, ProcessRecord()
was dropping the tracepoint samples because there were only kernel stack callchains (I added some log statements in cmd_record.cpp
to confirm.
Got it, thanks.
I agree about the memory copy overhead although I'm not sure how much smaller the unwound callchain ends up being versus the 64K stack sample and I also worry about losing samples while you're still unwind records even with larger mmap buffers but I'm sure you've looked into this.
With --call-graph dwarf
doing systemwide recording at anything over 99 Hz on a heavily utilized system seems to cause a lot of samples to be lost (30%+) for me.
Now that you mention it I remember reading somewhere in your documentation that the host can't unwind stacks for cross architectures so that does minimize the value of the no unwinding option.
In the AOSP master, the situation was worse because of https://android-review.googlesource.com/c/platform/system/extras/+/527373. test.py TestExamplePureJavaTraceOffCpu catches this, but I didn't run it when uploading the patch. And thanks to your reminding, I have fixed it in https://android-review.googlesource.com/c/platform/system/extras/+/560680.
I had the impression that memcpy takes more time when I was optimizing the record cmd long time ago. Currently I don't have data about the time used by unwinding vs timed used by memcpy. Because simpleperf doesn't collect time statistics about unwinding. I plan to collect unwinding failure results and time used by unwinding in the near future. So it will be clear. I am not surprised about the high sample lost rate when doing system wide dwarf call graph recording. Currently I have no idea except by adjusting parameters: a. sample at a lower frequency for a longer time. b. try --call-graph fp instead. c. use cpu-cycles:u instead of cpu-cycles if you are not interested in kernel code. d. decrease dumped stack size, like --call-graph dwarf,8192.
Can you also clarify why the number of samples recorded using task-clock
vs cpu-cycles
is different for some threads? If I sample system-wide there are some of my tasks where the number of samples recorded using both events is about the same but there are a few tasks that show up with drastically different numbers of samples depending on which event I use to record.
When using -f HZ option to set sample frequency (or by default using -f 4000), simpleperf asks the kernel to adjust the sample period of a event (how many events to generate one sample) to fulfill the sample frequency need. So comparing the number of samples using task-clock vs cpu-cycles only shows how good the kernel is at controlling the sample frequency of different event types.
And as I know, the kernel is better at controlling the sample frequency of task-clock than cpu-cycles. Because task-clock uses nanoseconds running, and that is easy to translate, sample_period = 10^9 / Hz. For cpu-cycles, the kernel tries to translate Hz to how many cpu cycles to generate one sample. But on Android, the cpu frequencies change with workload. So the kernel has to adjust the sample period when cpu frequency changes, and that's harder.
So instead of sample count, I recommend checking total event count. To make event count generated by different passes comparable, we usually force cpu online, fix cpu frequency, then use simpleperf stat command. One example is https://android.googlesource.com/platform/system/extras/+/master/app-launcher/.
When recording, we usually only use one event type, either cpu-cycles, or task-clock, or instructions. They all can show where the cpu is spent at, but have somewhat different results because of different implementations. cpu-cycles shows the consumption of cpu cycles, task-clock shows the consumption of cpu time, and instructions shows the consumption of instruction count. Now I prefer more in using task-clock, because it is easier to explain how much time we have spent in one function.
Thanks. Closing since you've addressed the issue for which this was created.
Description
When performing off-cpu profiling with the
--trace-offcpu
option if the--post-unwind
option is also specified then it seems that thesched_switch
tracepoint events aren't making it into the record file.I am seeing this with the SimpleperfExamplePureJava app (https://android.googlesource.com/platform/system/extras/+/master/simpleperf/demo/SimpleperfExamplePureJava/).
This is what works:
python app_profiler.py -p com.example.simpleperf.simpleperfexamplepurejava -a .SleepActivity -r "-e task-clock:u --call-graph dwarf -f 99 --trace-offcpu --duration 5
./inferno.sh -sc
This is what doesn't work:
python app_profiler.py -p com.example.simpleperf.simpleperfexamplepurejava -a .SleepActivity -r "-e task-clock:u --call-graph dwarf -f 99 --post-unwind --trace-offcpu --duration 5
./inferno.sh -sc
I also used
simpleperf dump
on the record file generated in both cases and in the latter case I don't see any of the tracepoint events in the sample records.Environment Details
NDK Version: Pkg.Revision = 16.0.4442984