drcachesim: optimize cache simulator

DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform

Other

2.61k stars 554 forks source link

drcachesim: optimize cache simulator #1738

Open zhaoqin opened 9 years ago

zhaoqin commented 9 years ago

Currently, the cache simulator is ~500x of native execution, the overhead including profiling overhead, communication overhead, but the cache simulator's overhead dominates the overall slowdowns.

One simple optimization is to parallel the cache simulator by splitting the memory into sub-regions and runs a cache simulator for each sub-region.

zhaoqin commented 9 years ago

Xref original issue #1703

peterpengwei commented 9 years ago

Does multithreading sound like a good solution to alleviate the issue? My initial thought is to assign each cache an independent pthread. The LLC thread contains a pthread mutex for all the I&D caches to arbiter it. If it sounds good, I will start to implement it to see if it helps.

zhaoqin commented 9 years ago

No, you should not parallel the cache simulator by assign each cache as an independent pthread. There would be significant communication overhead dominate the slowdown.

zhaoqin commented 9 years ago

The right way should be split the cache into subregion, and each subregion is simulated by one thread. Fro example, you can use 4 threads to simulate memory reference address from [4N, 4N+cacheline), [4N+cacheline, 4N+2xcacheline), [4N+2xcacheline, 4N+3xcacheline), and [4N+3xcacheline, 4N+4cacheline). By doing that, there would be no communication among the four threads, and should gain the max parallelization. The potential downside is if the memory reference might concentrate on one or two cache,.

derekbruening commented 3 years ago

With larger cache hierarchies and higher associativity (such as simulating a full 2-socket Skylake system) I'm seeing significant time spent walking the ways looking for tags, particularly in invalidate() (this is with coherence turned on as well). I found that inserting a hashtable (if it's initialized to a large enough starting size) results in a 15% speedup for my setup. I'll post the PR.