Open beepy0 opened 5 years ago
One difference between the distributions and sizes being tested is the significant drop in Microarchitecture Usage for both the discrete normal and uniform 100k cases. Both indicate that the function Xi_EH3::element
is being poorly utilized, although looking at the code for that function suggests the issue might be in subsequent calls made to other functions inside it.
The only thing that is being called other than the assignment of two variables is the function EH3, so any inherent problem should be part of that. The only difference between these two distributions and the other six is that both of them employ a strategy where values tend not to repeat often, especially for lower sample sizes. This effect could cause issues with Fast-AGMS but generally shouldn't impact AGMS due to the nature it generally processes new data. Expanding the Microarchitecture column reveals a pretty low CPI Rate, hinting that this is not a memory bound. VTune suggests to run a Microarchitecture Exploration Analysis in order to identify the issues. Another explanation for this deviation could simply be that this is a metric-related error coming from the fact that the VTune needed more data to display accurate results.
Comparing the microarchitecture usage between the two algorithms, Fast-AGMS seems to have a tendency of not fully utilizing the CPU compared to AGMS. This can be observed on the heatmap below [REF heatmap], where values below 70% are considered problematic according to VTune.
For AGMS, Update_Sketch
, seq_xor
and EH3
amount for 95.3% of total usage and all hover at or above 90% of utilization across the specific scenarios. In the case of Fast-AGMS, H3
and Update_Sketch
amount for 82.5% and range between the mid 70s to mid 80s. This analysis shows that microarchitecture analysis will be more beneficial for the latter as there is a bigger possibility to find a bottleneck other than a CPU-bound.
The total per-algorithm microarchitecture utilization has been displayed on the graph below [REF], as reported by VTune. Notice the decrease in utilization per distribution for Fast-AGMS. This further strengthens the assumption that the type of data does affect how this algorithm operates.
One assumption is that the distribution of the data would impact the throughput of the algorithm, as discussed previously. That didn't hold for AGMS, as the resulting time to complete the 100 million inputs run was withing margin of error for all three distributions, resulting in the same throughput up to the second decimal point. For FastAGMS though the benchmarks did yield somewhat of an increase in compute time going from zipfian to either normal or uniform, with uniform being the slowest, thus confirming our assumption for this test at least.
It is also no surprise that AGMS didn't get affected from the change of data frequency, since the vector being updated is the same for each input and is being fetched from cache in a sequential manner.
FastAGMS on the other hand hashes that input and calls for a somewhat random cell from each row of its update matrix, making it likely for cache misses to occur, or just stressing the system the more variance in the resulted hash there is. This can be also partially observed in the "other" section of the hotspots graph [REF], where the usage is increasing with the change of data in a similar fashion. That resulted in a throughput difference of close to 4% between zipfian and uniform data. Comparing the Fast-AGMS algorithm in terms of buckets size, there is no meaningful difference in terms of resulting throughput when it comes to the bucket size being a power of 2 or not. Note that the result could depend on the hardware, with varying architectures having different instruction bottlenecks.