Algos should be designed to fit data on PMem blocks (256 byte) rather than single cache lines (64 byte)
Use streaming ops or stores followed by clwb, especially for data written to the same cache line, (e.g. array-like structures with size field or a global counter for time-stamping)
Using too many threads can lead to reduced performance
PMem read & write bandwidth is lower than DRAM. Prefer DRAM for performance-critical code.
Guidelines for effective usage of PMem
clwb
, especially for data written to the same cache line, (e.g. array-like structures with size field or a global counter for time-stamping)