Error While Profiling C++ Code with Streamline on AWS Graviton 3

I'm following this tutorial to use Streamline for profiling a simple C++ code that utilizes ARM intrinsics.

Environment:

Hardware: AWS Graviton 3 CPU Counters: 2

Steps to Reproduce:

Install Streamline on AWS Graviton 3.
Compile and run the C++ code with ARM intrinsics.

Attempt to profile the code using the following command:

sl-record -C workflow_topdown_basic -o <output.apc> -A <your app command-line>

However, I get the following error:

Streamline Data Recorder v9.2.0 (Build ee3c2596c9f33b0d847028a8c8155e38d2c7a9a0 - Tag 0)
Copyright (c) 2010-2024 Arm Limited. All rights reserved.

Default perf mmap size set to 128 pages (512kb)
There are no mali devices to create readers
Detected 2 programmable event counters for Neoverse-V1 PMU
setpriority() failed
Gator ready
Counter 'ARMv8_Neoverse_V1_metric_backend_bound' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_backend_mem_bound' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_backend_stalled_cycles' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_branch_misprediction_ratio' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_branch_mpki' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_cpi' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_frontend_bound' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_frontend_stalled_cycles' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_integer_dp_percentage' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_ipc' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_itlb_mpki' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_itlb_walk_ratio' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_l1d_cache_miss_ratio' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_l1d_cache_mpki' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_l1i_cache_miss_ratio' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_l1i_cache_mpki' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_l1i_tlb_miss_ratio' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_l1i_tlb_mpki' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_l2_cache_miss_ratio' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_l2_cache_mpki' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_l2d_cache_miss_ratio' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_l2d_cache_mpki' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_l3_cache_miss_ratio' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_l3_cache_mpki' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_ll_cache_read_hit_ratio' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_ll_cache_read_miss_ratio' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_ll_cache_read_mpki' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_load_percentage' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_retired_ops_percent' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_scalar_fp_percentage' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_simd_percentage' was not recognized
Counter 'ARMv8_Neoverse_V1_metric_store_percentage' was not recognized
Found metrics set 0xf1cbdecd26f0 for core type Neoverse-V1, n_counters=2 (used 0, raw 2, ret 0, avail 2)
Combinations set size 26
Multiplexed CPU counters currently only work in system-wide mode, or when inherit is no/poll/experimental
Per-function metrics are not supported in application tracing mode when `--inherit yes` (the default) is used.
perf setup failed, are you running Linux 3.4 or later?
Unable to communicate with the perf API, please ensure that CONFIG_TRACING and CONFIG_CONTEXT_SWITCH_TRACER are enabled. Please refer to streamline/gator/README.md for more information.

Please provide guidance on how to resolve this error or suggest any potential misconfigurations or steps that might have been overlooked.

Hi @rakshithgb-fujitsu - The "Multiplexed CPU counters currently only work in system-wide mode, or when inherit is no/poll/experimental. Per-function metrics are not supported in application tracing mode when --inherit yes (the default) is used." is the relevant part of that dump of text...

For best results (assuming you are using Amazon Linux 2023) follow the instructions to patch the kernel, then retry. Otherwise your choice is to:

Pass -I no to sl-record if your test application is single threaded
Run in system wide mode (as root) (preferable if you have a test application with large number of threads, and/or you are running on an instance with a large (>8) number of cpus (by passing -S yes to sl-record)
Pass -I poll to sl-record if your test application is multithreaded (or multiprocess), but this can lead to an error where sl-record runs out of file descriptors. If you have root you can sometimes update the ulimit with something like: ulimit -H -n $((64*1024*1024)) ulimit -S -n $((64*1024*1024)) This approach is generally best used where there is a small number of threads * a small number of cpus (since sl-record must use one file-descriptor per-thread, per-cpu, per-counter).

For example:

sl-record -I no -C workflow_topdown_basic -o <output.apc> -A <your app command-line>

One other thing to note: it says there are only 2 PMU counters available, I guess you are running on a hypervised instance rather than on a metal instance. This will work, but the kernel will have to multiplex the various groups of counters needed for collecting all the metrics; the kernel does this once every ~3ms so for good coverage your workload needs to run for a fairly long amount of time (e.g multiple seconds)

I'd also add that you are using an older version of the tool (9.2.0). We've now released 9.2.2 which includes some bug fixes that are worth picking up:

https://artifacts.tools.arm.com/arm-performance-studio/2024.3/Arm_Streamline_CLI_Tools_9.2.2_linux_arm64.tgz

@bengaineyarm -I no worked for my single threaded test! Thank you for this information. I have a couple of follow up questions. And yes I'm currently running the tests on a hypervised instance.

1) How much of an impact would the patching actually have for the perf analysis? I ask this because according to the docs it shows all options are available in both cases, so what would be the difference? 2) How is streamline different from the regular perf tool? Any tips on how to use streamline is much appreciated. (we work on tunning mathematical kernels such as matrix multiplications etc.) 3) As you pointed out regarding the hypervised instance, is there any minimum required time for the program to run to capture enough data?

The option that is missing without the patches is -I experimental; the patch does two things:

The first half ( the latest upstream version is https://lore.kernel.org/linux-perf-users/20240730084417.7693-1-ben.gainey@arm.com/ ) allows us to support collecting groups of performance counters in a multithreaded application without having to use the poll method. This means we can attach one set of counters to the first thread and the kernel will automatically apply them to all its children (so sl-record only needs 1*per-cpu*per-counter instead of per-thread*per-cpu*per-counter; this reduces the risk of the "run out of file descriptors" problem, and also avoids sl-record from having to manually poll for new threads (which is racey). Without this change you get the limitations mentioned in my original reply around profiling multithreaded/multiprocess applications.
The second half of the patch ( https://lore.kernel.org/linux-perf-users/20240422104929.264241-1-ben.gainey@arm.com/ ) allows sl-record to alternate its sample rate between a long and short period; this allows us to collect performance counters over a small window of time in a much more efficient way than otherwise.

Without the patch sl-record is required to sample at very high frequency to get the same effect. This leads to much larger overhead from sampling which can have a worse impact on the collected data (more risk of perturbing the cache/pipeline/branch predictor etc), as well as potentially much larger captures so slower for sl-analyze to process.

The Streamline CLI tools are not intended to be a replacement for perf record et al. They focus on specific set of workflows around tuning for Arm platforms. They form part of the Streamline tool which ships with Arm Performance Studio. Support for top-down function metrics is new, so one of the reasons for releasing these command line tools separately is to enable early / fast feedback on this feature whilst we work on integrating them into the GUI tools, refining them and so on. In that regard, we'd greatly appreciate any feedback (not just on tool issues/bugs/usability) but on the topdown metrics themselves; if anything is unclear, cases where the metrics produce unintuitive results, data that you wished was available but appears to be missing etc. For tuning kernels the top-down metrics approach should be well suited.
There are really two things to consider here:

The first is that some of the top-down metrics are derived from more than two raw PMU counters; on the hypervisor these metrics will simply not work (if you select them they will not be collected).
The second is that sl-record will try to fit all the raw PMU counters it needs for each of the metrics that you select into as small a number of counter groups as possible, but how it can pack them together is a function of the number of available PMU counter slots from the OS (in your case 2). When there are multiple groups of counters the kernel will multiplex them meaning that each group is only active for ~3ms then disabled whilst other groups are each given a turn. When there are less PMU counter slots, sl-record cannot pack them as tightly as it can with e.g 6 PMU counter slots, so you end up in total with more groups of counters, each of which being multiplexed every 3ms or so. If there are 10 groups of counters each metric will only see 1/10th the time that the application is running. If your application runs for a short amount of time, you might not get all of the counter groups by the time the application ends (or they may only have been scheduled a few times so you don't get a very representative sample).

On a hypervised instance its probably worth running your application for at least a few seconds, and/or being selective about which metrics you enable (starting with -C workflow_topdown_basic is good, but you can use sl-record --print counters|grep -i metric to see individual metrics). You will see in the results if the metrics look patchy (e.g. reading as zeros unexpectedly)

You might also consider the -r high option which increases the sample rate from ~1kHz to ~10kHz meaning that you should get better coverage for shorter workloads (NB: this is the rate at which samples are taken, not the 3ms rate at which multiplexing occurs which is not controllable by sl-record)

@rakshithgb-fujitsu How did you get on? If you have any feedback on the tools or the data they produce we'd love to hear it.

@solidpixel Apologies for not getting back on this, we've not really got the chance to spend much time on this tool. But the time we've spent so far on it, we do think a richer visualization would definitely help (example - https://github.com/jrfonseca/gprof2dot). We will try to evaluate this tool in the coming months and keep you guys posted.

[like] Peter Harris reacted to your message:

From: RakshithGB @.> Sent: Sunday, August 18, 2024 8:24:40 AM To: ARM-software/Streamline @.> Cc: Peter Harris @.>; Mention @.> Subject: Re: [ARM-software/Streamline] Error While Profiling C++ Code with Streamline on AWS Graviton 3 (Issue #1)

@solidpixelhttps://github.com/solidpixel Apologies for not getting back on this, we've not really got the chance to spend much time on this tool. But the time we've spent so far on it, we do think a richer visualization would definitely help (example - https://github.com/jrfonseca/gprof2dot). We will try to evaluate this tool in the coming months and keep you guys posted.

— Reply to this email directly, view it on GitHubhttps://github.com/ARM-software/Streamline/issues/1#issuecomment-2295175291, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFPTJKNNK7FNJZK2CZRJJ33ZSBK4RAVCNFSM6AAAAABLYA5HYGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJVGE3TKMRZGE. You are receiving this because you were mentioned.Message ID: @.***>< /p>

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM-software / Streamline

Error While Profiling C++ Code with Streamline on AWS Graviton 3 #1

Environment:

Steps to Reproduce: