bespoke-silicon-group / bsg_replicant

BSG Replicant: Cosimulation and Emulation Infrastructure for HammerBlade
BSD 3-Clause "New" or "Revised" License
26 stars 20 forks source link

CUDA Stat Printer Unexpected Behavior #815

Closed natewise closed 1 year ago

natewise commented 1 year ago

Hello!

So I'm trying to run a very simple program, where I'm trying to see how many cycles an integer and floating point add, multiply, and divide will take, but I'm running into some unexpected behavior. I included as comments how many cycles each of these are taking from values in vanilla_stats.csv

image

For reference, if it's helpful I attached the vanilla_stats.csv for this program here: v1_stats.csv

I probed further by running this example:

image

In this 2nd example, I would expect an integer add to be only 1 to a few instructions. The fact that it's 20 cycles with 20 operations I think means that the profiler call is being included in the stats (I checked v1_stats.csv and there are remote_ld and remote_st instructions I never directly invoked, so I'm pretty sure these calls are being included in the stats). If that's the case, is there a way to just get the cycle count of what's between the cuda_stat_print calls? It's also unusual to me that the 3 float divisions aren't taking longer than a single float division (or the rest of the operations). Maybe gcc could be optimizing things and I'm just unaware. I was just hoping the behavior of these cuda_stat_print functions could be explained, and maybe if possible, the correct approach for this program be given. Thank you very much!

dpetrisko commented 1 year ago

The typical way to do this kind of microbenchmarking is to make repeated calls and average the results. Beyond stats calls, you're also dealing with cache misses and branches which need time to train. For instance, ubenchmarks for energy are here: https://github.com/bespoke-silicon-group/bsg_manycore/blob/master/software/spmd/energy_ubenchmark/add.S