Closed natewise closed 1 year ago
The typical way to do this kind of microbenchmarking is to make repeated calls and average the results. Beyond stats calls, you're also dealing with cache misses and branches which need time to train. For instance, ubenchmarks for energy are here: https://github.com/bespoke-silicon-group/bsg_manycore/blob/master/software/spmd/energy_ubenchmark/add.S
Hello!
So I'm trying to run a very simple program, where I'm trying to see how many cycles an integer and floating point add, multiply, and divide will take, but I'm running into some unexpected behavior. I included as comments how many cycles each of these are taking from values in
vanilla_stats.csv
For reference, if it's helpful I attached the vanilla_stats.csv for this program here: v1_stats.csv
I probed further by running this example:
In this 2nd example, I would expect an integer add to be only 1 to a few instructions. The fact that it's 20 cycles with 20 operations I think means that the profiler call is being included in the stats (I checked v1_stats.csv and there are remote_ld and remote_st instructions I never directly invoked, so I'm pretty sure these calls are being included in the stats). If that's the case, is there a way to just get the cycle count of what's between the cuda_stat_print calls? It's also unusual to me that the 3 float divisions aren't taking longer than a single float division (or the rest of the operations). Maybe gcc could be optimizing things and I'm just unaware. I was just hoping the behavior of these cuda_stat_print functions could be explained, and maybe if possible, the correct approach for this program be given. Thank you very much!