RRZE-HPC / pycachesim

Python Cache Hierarchy Simulator
GNU Affero General Public License v3.0
88 stars 27 forks source link

Understanding stats: more hits than loads in L1 ?! #13

Closed pedroexenberger closed 3 years ago

pedroexenberger commented 3 years ago

Hi all and thanks for the tool.

I'm feeding pycachesim with addresses and sizes to collect cache statistics. I only issue loads, since stores are done in the initialization of my application and are negligible for my initial estimation. I have instantiated a memory hierarchy exactly as in the readme

mem = MainMemory()
l3 = Cache("L3", 20480, 16, 64, "LRU")  # 20MB: 20480 sets, 16-ways with cacheline size of 64 bytes
mem.load_to(l3)
mem.store_from(l3)
l2 = Cache("L2", 512, 8, 64, "LRU", store_to=l3, load_from=l3)  # 256KB
l1 = Cache("L1", 64, 8, 64, "LRU", store_to=l2, load_from=l2)  # 32KB
cs = CacheSimulator(l1, mem)

and then I do a lot of loads:

for line in trace:
    address, n_bytes = parse_line(line)
    cs.load(address, length=n_bytes)
cs.force_write_back()
cs.print_stats()

However, when I print the stats, I'm getting more hits than loads in the L1:

CACHE *******HIT******** *******MISS******* *******LOAD******* ******STORE******* ******EVICT*******
   L1 6774536 (345751301B)  15627 (  812688B) 3815705 (194257152B)      0 (       0B)      0 (       0B)
   L2   8025 (  513600B)   7602 (  486528B)  15627 ( 1000128B)      0 (       0B)      0 (       0B)
   L3    269 (   17216B)   7333 (  469312B)   7602 (  486528B)      0 (       0B)      0 (       0B)
  MEM   7333 (  469312B)      0 (       0B)   7333 (  469312B)      0 (       0B)      0 (       0B)

Also, adding bytes from hits and misses, does not sum up to loaded bytes. Are loaded bytes taken into account differently from hits/miss calculation? Am I assuming something here that pycachesim is not able to handle?

Also, n_bytes often is larger than a cache line (64B), because I trace the load of a variable size structure. Do I have to manually break it into loads of 64B (or less) for the simulator to work correctly? Or it already breaks it into multiple loads under the hood?

For L2 and higher, data seems to be consistent (e.g., hit+miss = loads).

Thank you in advance.

pedroexenberger commented 3 years ago

Here is a minimal reproducible example to illustrate the question (hits > loads):

from cachesim import CacheSimulator, Cache, MainMemory

mem = MainMemory()
l3 = Cache("L3", 20480, 16, 64, "LRU")  # 20MB: 20480 sets, 16-ways with cacheline size of 64 bytes
mem.load_to(l3)
mem.store_from(l3)
l2 = Cache("L2", 512, 8, 64, "LRU", store_to=l3, load_from=l3)  # 256KB
l1 = Cache("L1", 64, 8, 64, "LRU", store_to=l2, load_from=l2)  # 32KB
cs = CacheSimulator(l1, mem)

cs.load(10, length=64)  
cs.load(10, length=64)  
cs.load(10, length=64)  
cs.load(10, length=64)  

cs.force_write_back()
cs.print_stats()

Output:

CACHE *******HIT******** *******MISS******* *******LOAD******* ******STORE******* ******EVICT*******
   L1      6 (     384B)      2 (     128B)      4 (     256B)      0 (       0B)      0 (       0B)
   L2      0 (       0B)      2 (     128B)      2 (     128B)      0 (       0B)      0 (       0B)
   L3      0 (       0B)      2 (     128B)      2 (     128B)      0 (       0B)      0 (       0B)
  MEM      2 (     128B)      0 (       0B)      2 (     128B)      0 (       0B)      0 (       0B)
cod3monk commented 3 years ago

Yes, that is a peculiarity, but intended behavior. L1 behaves different from the other levels, because the L1 interfaces to load/stores of arbitrary length, while inter-cache communication is locked to cache line length transfers.

In your minimal example: The first cs.load(10, length=64) is counted as a single LOAD of length 64B in L1 and triggers two misses, because the load spans two cache lines (0-63 and 64-127) – both of which are not present in the L1. Thus, hits and misses combined are twice the size of loads, because hits and misses consider complete cache lines, while loads and stores (in L1) consider what was requested. The misses are passed onto L2, triggering two loads and subsequent misses (each of 64-byte) and so on until they reach MEM.

All subsequent loads of the same address and length are each two hits in L1 (because they span two cache lines) and are not passed onto L2.

Does this answer your question? Reopen or keep asking if it did not :)

pedroexenberger commented 3 years ago

I think the problem is due to a load() that crosses a cache line. In my example, since I started at address 10, it was reading from address 10 to 74 in a single load.

Anyway, I had adapted my simulation to break long loads into chunks of 8 bytes (64 bits), 8-byte aligned (adjusting the total size in bytes). With these modifications, numbers are consistent.

pedroexenberger commented 3 years ago

Hi @cod3monk , thanks for the answer. I think we posted a comment at almost the same time :)

Yes, I understand your point. But still,

Thus, hits and misses combined are twice the size of loads, because hits and misses consider complete cache lines, while loads and stores (in L1) consider what was requested.

This behavior does not happen if, e.g., we ask for a single byte. E.g., cs.load(10, length=1) will cause 1B miss (not 64B miss). So in this case, the number of bytes missed is the length. But if the request crosses a line boundary, then it counts for 2 complete lines missed: e.g., cs.load(10, length=64) will cause 128B miss (not the length, as in the previous case).

I already found a workaround that makes sense for my case, where I make sure I never cross a boundary when doing a load. So I'm satisfied with what I have now. But perhaps this counter-example has some worth for you and the project...

Anyway, thanks again for your time!