I have done a bunch of testing of the QPI events on Xeon E5 v3 and have concluded that many of the QPI events are badly broken. Using the STREAM benchmark, with and without streaming stores, with local data and remote data, the data shows that two of the events used by tacc_stats on Haswell appear to be correct, one is definitely broken, and the last event is one I don't currently know how to test.
The TxL_FLITS_G1.SNOOP (Event 0x00+bit21, Umask 0x01) event used by tacc_stats appears to be counting correctly in Home Snoop mode. Note that this will only increment for local accesses in Home Snoop mode, since remote accesses are handled by Read Request (counted by the HOM_REQ event). TACC's Hikari and Lonestar5 systems currently operate in Home Snoop mode, but TACC's Wrangler system was booted in Early Snoop mode the last time I checked, and I have not checked the accuracy of the counters there.
The TxL_FLITS_G1.HOM_NONREQ (Event 0x00+bit21, Umask 0x04) event appears to be counting correctly in Home Snoop mode. Again, the number of Snoop requests and responses will be different in Early Snoop mode and I have not checked the accuracy of the counters in that mode.
The RxL_FLITS_G1.DRS_DATA (Event 0x02+bit21, Umask 0x08) event appears to be undercounting by 50% to 60% for ordinary reads, RFOs, and streaming stores. Both QPI channels undercount, but reads and RFOs counts are ~12-17% lower on Channel 1 than on Channel 0. For historical data is is probably OK to just double the recorded values, but for the future I recommend switching to an equivalent event that is more accurate: TxL_FLITS_G1.DRS_DATA (Event 0x00+bit21, Umask 0x08).
I don't currently have a methodology to generate a known traffic pattern to test RxL_FLITS_G2.NCB_DATA, but given the trouble with RxL_FLITS_G1 events, I would be suspicious. If we can find a workload that generates non-trivial event counts here, we should at least compare the RxL_FLITS_G2.NCB_DATA to the TxL_FLITS_G2.NCB_DATA to see if the transmit side is counting at twice the rate for NCB data like it is for DRS data.
Since these are two-socket systems, the TxL_FLITS_G1.SNOOP counts on one socket should always match the TxL_FLITS_G1.HOM_NONREQ counts of the other socket, and I have observed essentially perfect matches in all of my tests. Since one of these can be considered redundant, I recommend that we replace TxL_FLITS_G1.HOM_NONREQ (snoop responses) with TxL_FLITS_G1.HOM_REQ (read requests). This will provide some information about remote memory accesses. My tests indicate that TxL_FLITS_G1.HOM_REQ increments once for cross-socket Read, RFO, or Writeback transactions, and twice for cross-socket Streaming Store transactions. Although this is a bit weird, comparing TxL_FLITS_G1.HOM_REQ with IMC CAS read+write transactions should provide a useful measure of transaction locality. I have not figured out if there is a clever way to combine these transaction requests with the corresponding data traffic to figure out the transaction types -- at first glance there are more unknowns than known values, but I have not spent a lot of time looking at this....
I have done a bunch of testing of the QPI events on Xeon E5 v3 and have concluded that many of the QPI events are badly broken. Using the STREAM benchmark, with and without streaming stores, with local data and remote data, the data shows that two of the events used by tacc_stats on Haswell appear to be correct, one is definitely broken, and the last event is one I don't currently know how to test.
The TxL_FLITS_G1.SNOOP (Event 0x00+bit21, Umask 0x01) event used by tacc_stats appears to be counting correctly in Home Snoop mode. Note that this will only increment for local accesses in Home Snoop mode, since remote accesses are handled by Read Request (counted by the HOM_REQ event). TACC's Hikari and Lonestar5 systems currently operate in Home Snoop mode, but TACC's Wrangler system was booted in Early Snoop mode the last time I checked, and I have not checked the accuracy of the counters there.
The TxL_FLITS_G1.HOM_NONREQ (Event 0x00+bit21, Umask 0x04) event appears to be counting correctly in Home Snoop mode. Again, the number of Snoop requests and responses will be different in Early Snoop mode and I have not checked the accuracy of the counters in that mode.
The RxL_FLITS_G1.DRS_DATA (Event 0x02+bit21, Umask 0x08) event appears to be undercounting by 50% to 60% for ordinary reads, RFOs, and streaming stores. Both QPI channels undercount, but reads and RFOs counts are ~12-17% lower on Channel 1 than on Channel 0. For historical data is is probably OK to just double the recorded values, but for the future I recommend switching to an equivalent event that is more accurate: TxL_FLITS_G1.DRS_DATA (Event 0x00+bit21, Umask 0x08).
I don't currently have a methodology to generate a known traffic pattern to test RxL_FLITS_G2.NCB_DATA, but given the trouble with RxL_FLITS_G1 events, I would be suspicious. If we can find a workload that generates non-trivial event counts here, we should at least compare the RxL_FLITS_G2.NCB_DATA to the TxL_FLITS_G2.NCB_DATA to see if the transmit side is counting at twice the rate for NCB data like it is for DRS data.
Since these are two-socket systems, the TxL_FLITS_G1.SNOOP counts on one socket should always match the TxL_FLITS_G1.HOM_NONREQ counts of the other socket, and I have observed essentially perfect matches in all of my tests. Since one of these can be considered redundant, I recommend that we replace TxL_FLITS_G1.HOM_NONREQ (snoop responses) with TxL_FLITS_G1.HOM_REQ (read requests). This will provide some information about remote memory accesses. My tests indicate that TxL_FLITS_G1.HOM_REQ increments once for cross-socket Read, RFO, or Writeback transactions, and twice for cross-socket Streaming Store transactions. Although this is a bit weird, comparing TxL_FLITS_G1.HOM_REQ with IMC CAS read+write transactions should provide a useful measure of transaction locality. I have not figured out if there is a clever way to combine these transaction requests with the corresponding data traffic to figure out the transaction types -- at first glance there are more unknowns than known values, but I have not spent a lot of time looking at this....