Closed AKKamath closed 2 years ago
Hi,
The NVIDIA Volta architecture does not have hardware divider units. Instead, divisions are implemented by following an algorithm that uses other instructions (e.g., FMAs, shifts). However, the PTX ISA includes division instructions, and are thus simulated by Accel-Sim in event-driven mode. Hence, in the PTX model of AccelWattch, we model and report INT32, FP32, and FP64 divisions separately to capture the power consumption of their implementation.
The scaled per-access energy of these three instructions are higher (by orders of magnitude) than other arithmetic instructions in the AccelWattch PTX model for Volta. You can calculate this number by multiplying the per-access-energy given by McPAT for these instructions with their respective scaling factors. For instance, for FP_DIV, it is (3.4293E-11 * 4.5999) 157.74 pJ while the same for an FP_MUL is 7.289 pJ. Note that DP_DIV has a significantly longer initiation_interval and latency compared to INT_DIV and FP_DIV. Hence, the energy spent doing these FP64 divisions is spread across a longer time period, and shows a lower relative power consumption even though the per-access energy of FP64 divisions is higher than FP32 divisions in AccelWattch PTX model.
In the PTX config file that you pointed to, the latency (ptx_opcode_latency_int/fp/dp) for these INT32, FP32, and FP64 divisions are 21, 39, 330 respectively. Note that the initiation_interval in Accel-Sim models only the throughput of these instructions and you have to consider the latency of these instructions too.
Hope this helps! Vijay
Hi. I used PTX emulation mode on GPGPU-Sim alongside Accel-wattch to capture the power consumption by different instructions on the provided V100 config. I ran the functional microbenchmarks used in the original paper as a reference.
I plotted the average power consumption obtained, shown in the graph below.
It seems as though the div operation (FP_DIV, INT_DIV) is drawing a lot of power, and may be incorrect.
I noticed that the initiation interval in the PTX config is 4 for FP_DIV, compared to 2 for other FP operations. This seems a bit too low, and I wanted to double check if this was correct.
Thanks