PTX emulation, div power seems too high

Hi,

The NVIDIA Volta architecture does not have hardware divider units. Instead, divisions are implemented by following an algorithm that uses other instructions (e.g., FMAs, shifts). However, the PTX ISA includes division instructions, and are thus simulated by Accel-Sim in event-driven mode. Hence, in the PTX model of AccelWattch, we model and report INT32, FP32, and FP64 divisions separately to capture the power consumption of their implementation.

The scaled per-access energy of these three instructions are higher (by orders of magnitude) than other arithmetic instructions in the AccelWattch PTX model for Volta. You can calculate this number by multiplying the per-access-energy given by McPAT for these instructions with their respective scaling factors. For instance, for FP_DIV, it is (3.4293E-11 * 4.5999) 157.74 pJ while the same for an FP_MUL is 7.289 pJ. Note that DP_DIV has a significantly longer initiation_interval and latency compared to INT_DIV and FP_DIV. Hence, the energy spent doing these FP64 divisions is spread across a longer time period, and shows a lower relative power consumption even though the per-access energy of FP64 divisions is higher than FP32 divisions in AccelWattch PTX model.

In the PTX config file that you pointed to, the latency (ptx_opcode_latency_int/fp/dp) for these INT32, FP32, and FP64 divisions are 21, 39, 330 respectively. Note that the initiation_interval in Accel-Sim models only the throughput of these instructions and you have to consider the latency of these instructions too.

Hope this helps! Vijay

accel-sim / accel-sim-framework

PTX emulation, div power seems too high #121