android / ndk

The Android Native Development Kit
2.02k stars 258 forks source link

[BUG] Zero-sized output when using simpleperf to convert profile to AutoFDO format #1564

Closed Over17 closed 3 years ago

Over17 commented 3 years ago

Description

Using simpleperf to convert perf.data into autofdo format returns zero-sized output. (need it for PGO experiments)

I tried running simpleperf both on the host (windows) and on the device.

Command line used:

To collect the profile python app_profiler.py -p com.unity3d.torture

To convert on the host: c:\android-ndk-r23\simpleperf>bin\windows\x86_64\simpleperf.exe inject -i perf.data -o autofdo.txt --output autofdo

Command line on the device used:


c:\android-ndk-r23\simpleperf>adb pull /data/local/tmp/autofdo.txt
/data/local/tmp/autofdo.txt: 1 file pulled.

c:\android-ndk-r23\simpleperf>adb shell
o1s:/ $ ls -lA /data/local/tmp
total 11924
-rw-rw-rw- 1 shell shell       0 2021-08-18 09:24 autofdo.txt
-rw-r--r-- 1 shell shell 8470111 2021-08-18 09:22 perf.data
-rwxrwxrwx 1 shell shell 3726800 2021-07-31 01:03 simpleperf

Attaching perf.data just in case. perf.zip

Looks weird because the code is definitely there https://android.googlesource.com/platform/system/extras/+/master/simpleperf/cmd_inject.cpp#102

Documentation https://source.android.com/devices/tech/perf/pgo#collecting-profiles says At this time, Android does not support using sampling-based profile collection but makes no sense since I can collect sample-based profiles using simpleperf, huh?

Environment Details

Not all of these will be relevant to every bug, but please provide as much information as you can.

enh-google commented 3 years ago

assigning to yabinc as "mr simpleperf", but also adding the PGO folks since i don't know enough about this stuff to know who's best suited to look at this, and it might end up involving everyone anyway... :-)

yabinc commented 3 years ago

Documentation https://source.android.com/devices/tech/perf/pgo#collecting-profiles says At this time, Android does not support using sampling-based profile collection but makes no sense since I can collect sample-based profiles using simpleperf, huh?

simpleperf can record samples, but not all samples can be used for PGO. To be useful to PGO, the samples need to have branch information, so the compiler knows which branch directions are more likely to happen and worth optimizing. Intel x86 supports this by LBR(last branch record), which can be recorded using -b option in record cmd. And ARM supports this by Coresight ETM, which can be recorded using -e cs-etm option in record cmd. And simpleperf inject only supports perf.data generated by -e cs-etm option.

For security reason, ETM can't be available to user device soon. Here is more info, https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/collect_etm_data_for_autofdo.md. So currently the only way for app PGO is instrumented-based PGO. Here is another doc for it, https://medium.com/androiddevelopers/pgo-for-native-android-applications-1a48a99e95d0.

Over17 commented 3 years ago

Thank you @enh-google and @yabinc.

Do you know if Arm ETM will be supported in Armv9 devices - is the extension going to be mandatory then?

The article by androiddevelopers is super useful. I tried building instrumented build but didn't seem to get any traces written, which may be explained by the lack of __llvm_profile_write_file() call.

I need to add -fprofile-generate to my compiler and linker invocations, but will it work if I add it to only one of the .so's in the APK? The docs in https://source.android.com/devices/tech/perf/pgo#enabling-pgo-in-android-bp-files are a bit unclear, or at least I'm having hard times deciphering

Static libraries instrumented with PGO, all shared libraries, and any binary that directly depends on the static library must also be instrumented for PGO. However, such shared libraries or executables don't need to use PGO profiles, and their enable_profile_use property can be set to false. Outside of this restriction, you can apply PGO to any static library, shared library, or executable.

Or if I have multiple so's that I want to instrument, do I need to call __llvm_profile_write_file() from each of them? The function is likely defined in the static lib which is linked by the linker flag.

DanAlbert commented 3 years ago

(closing since it seems there's no bug to fix, but we're still here to answer questions)

stephenhines commented 3 years ago

Do you know if Arm ETM will be supported in Armv9 devices - is the extension going to be mandatory then?

The hardware is actually there on almost all existing ARM devices, but is fused off for security reasons on production devices. We're hoping that future devices won't have such limitations, but don't have anything else to share about that right now.

pirama-arumuga-nainar commented 3 years ago

The docs in https://source.android.com/devices/tech/perf/pgo#enabling-pgo-in-android-bp-files

These are docs for PGO on platform libraries and are not relevant for applications.

Or if I have multiple so's that I want to instrument, do I need to call __llvm_profile_write_file() from each of them? The function is likely defined in the static lib which is linked by the linker flag.

It should be called for each shared library. Each .so has a LOCAL/hidden copy of this function that writes profiles for that particular library. I think calling dlclose() on each library may also work.

DanAlbert commented 3 years ago

I think calling dlclose() on each library may also work.

C++ effectively requires that dlclose does nothing for a lot of programs. If you use this trick, expect it to stop working in the future.

Over17 commented 3 years ago

Calling __llvm_profile_write_file() returns -1, and there is no error printed anywhere - do I maybe need to call __llvm_profile_set_filename() earlier or something like that? I doublechecked that -fprofile-generate is passed to the compiler and the linker of one of the libraries (and its sized increased by ~20meg).

Over17 commented 3 years ago

Strike that - calling __llvm_profile_set_filename() on a writeable path seems to have worked!

Over17 commented 3 years ago

In the end, I was able to make a POC with PGO using Android NDK and instrumented builds. Thank you everyone!

pirama-arumuga-nainar commented 3 years ago

@Over17 Can you share if PGO was beneficial in this case?

Over17 commented 3 years ago

@pirama-arumuga-nainar sorry for the late answer, have been away. Yes I was able to get 4-7% better CPU performance (plus, looking at LLVM remarks and the code, there is potential for more vectorization in some places). For the users, the workflow overhead by having to have an instrumented build => run the benchmark (or even manual testing) => optimized build is quite a significant drawback. It may be easier for server workflows or even end-user apps, but for our product it's a bit more problematic.

pirama-arumuga-nainar commented 3 years ago

For the users, the workflow overhead by having to have an instrumented build => run the benchmark (or even manual testing) => optimized build is quite a significant drawback.

Agreed. It helps that Clang is tolerant of different/changing source code. For the Android platform, we don't create a profdata during the build. Instead a job in our CI collects it ~once per day. Approximately once a week, we get this profdata and check that into source control and use it for optimized build.

Over17 commented 3 years ago

Yes that was one of the concerns, but the paper on AutoFDO and your experience proves the opposite.

Another workflow issue for us is that the engine is shipped to the gamedevs precompiled and optimized, so it's impossible to apply compile-time optimizations at this stage. Some of the code is being compiled on the gamedevs machine so PGO is doable, but not to the core of the engine.