CUDA Stat Printer Unexpected Behavior

Hello!

So I'm trying to run a very simple program, where I'm trying to see how many cycles an integer and floating point add, multiply, and divide will take, but I'm running into some unexpected behavior. I included as comments how many cycles each of these are taking from values in vanilla_stats.csv

For reference, if it's helpful I attached the vanilla_stats.csv for this program here: v1_stats.csv

I probed further by running this example:

In this 2nd example, I would expect an integer add to be only 1 to a few instructions. The fact that it's 20 cycles with 20 operations I think means that the profiler call is being included in the stats (I checked v1_stats.csv and there are remote_ld and remote_st instructions I never directly invoked, so I'm pretty sure these calls are being included in the stats). If that's the case, is there a way to just get the cycle count of what's between the cuda_stat_print calls? It's also unusual to me that the 3 float divisions aren't taking longer than a single float division (or the rest of the operations). Maybe gcc could be optimizing things and I'm just unaware. I was just hoping the behavior of these cuda_stat_print functions could be explained, and maybe if possible, the correct approach for this program be given. Thank you very much!

bespoke-silicon-group / bsg_replicant

CUDA Stat Printer Unexpected Behavior #815