flaming-cl / 61C

1 stars 0 forks source link

Cache #11

Open flaming-cl opened 3 years ago

flaming-cl commented 3 years ago

Reading & Other Materials

1. reading

https://www.oschina.net/translate/what-every-programmer-should-know-about-cpu-cache-part2?cmp&p=1 https://coolshell.cn/articles/10249.html

2. other

https://www.youtube.com/watch?v=UCK-0fCchmY

TIO

Screen Shot 2021-06-04 at 9 58 00 AM

Data loading (with examples in hardware

1. When hit

1 ) the valid bit is on and 2 ) the tag matches

2. Why use a mux here

correspond to the block index, then choose the exact block data

3. What kind of locality are we taking advantage of ?

Spatial: coz ur block size is more than just one byte (or word) Temporal: coz this is cache (cache stores the data that you use frequently

Screen Shot 2021-06-04 at 10 09 28 AM

Load

Screen Shot 2021-06-04 at 9 13 49 AM

Write

数据应该写入 Cache 还是内存?

write-through

WRITE request -> cache hit -> to cache block ->to memory
WRITE request -> cache miss -> to memory

write-through 的归途始终是内存,只不过 Cache 中有对应数据时,要先写入 Cache 更新一下数据。
问题也就来了,把数据写入内存始终有些慢(相比直接写入 Cache)
这里可以采用 write-back 的方式,来写入数据

write-back

image write-back 通常只更新缓存,只有在需要把缓存里面的脏数据交换出去的时候,才同步已有数据到主内存里。在缓存经常会命中的情况下,性能更好。 :Cache block 中的数据和主内存中的不一致,Cache block 中数据已更新,主内存中的相应数据还没更新
问题:即便我们减少了主内存的写入操作,提高了效率,但多核CPU还会存在缓存一致性的问题。为了解决这个问题,我们要引入一个新概念——MESI协议

解决缓存不一致问题

解决这个问题要达到两个前提条件:

1. write propagation

任一 CPU 核心中有缓存更新,其他 CPU 核心能够知晓这一变化

2. transaction serialization

任一CPU 核心中有缓存更新的批量操作,其他 CPU 核心能够执行统一的批量操作顺序。这样的事务串行操作有两个前提:1)write propagation 2) 两个 CPU 核心中的同一个数据 Cache,在拿到对应的 Cache block 锁之后,才进行更新(这话很绕,不好意思)

Bus Snooping

为了解决缓存不一致的问题,我们可以构建一个 bus,bus 记录了我们数据读写的所有请求。接着由各个 CPU 核心自己去 bus 查看请求情况(这个办法 Intel CPU 常用

write invalidate

invalidate:字面意思,数据已失效 指派一个 CPU 核心负责写入数据,而其他的 CPU 核抄袭这个 CPU 写过的数据
这个 CPU 核把数据写入 Cache 后,它会广播一个 invaliadate 请求,通知其他核心哪一个内存地址的缓存失效了。其他核自行判断是否也有相应的 invaliadate cache block,自行标记。

write Broadcast

众生平等,多个 CPU 核,不管谁要写入数据,所以 CPU 核都会收到写入请求的广播,同时更新各自的 Cache 数据。 嗯,上述操作简单粗暴,粗暴之下则可能有所牺牲——占用更多总线带宽, write broadcast 比 write invalidate 多了一步数据内容传输

MESI

M: modified E: exclusive s: shared i: invalidated 在独占状态下,对应的 Cache Line 只加载到了当前 CPU 核所拥有的 Cache 里。其他的 CPU 核,并没有加载对应的数据到自己的 Cache 里。这个时候,如果要向独占的 Cache Block 写入数据,我们可以自由地写入数据,而不需要告知其他 CPU 核。

在独占状态下的数据,如果收到了一个来自于总线的读取对应缓存的请求,它就会变成共享状态。这个共享状态是因为,这个时候,另外一个 CPU 核心,也把对应的 Cache Block,从内存里面加载到了自己的 Cache 里来。

在共享状态下,因为同样的数据在多个 CPU 核心的 Cache 里都有。所以,当我们想要更新 Cache 里面的数据的时候,不能直接修改,而是要先向所有的其他 CPU 核心广播一个请求,要求先把其他 CPU 核心里面的 Cache,都变成无效的状态,然后再更新当前 Cache 里面的数据。这个广播操作,一般叫作 RFO(Request For Ownership),也就是获取当前对应 Cache Block 数据的所有权。 image

Block size trade off

pros

读取某个word,当 block size 更大时,spatial locality 可能使用得更充分
good for store-program concept/sequential array accesses

cons

larger miss penalty: 要多时间重新 load 一个 new block block size is too big relative to cache size: miss rate goes up Extreme e.g.: one big block Screen Shot 2021-06-04 at 12 44 38 PM Screen Shot 2021-06-04 at 12 47 10 PM

misses (two important factors to consider: infinite-size or finite; fully-associated or finite-associated)

  1. compulsory miss
  2. conflict miss: 两个不同的内存地址,指向同一个cache地址;two blocks keep overwriting each other (big problem in direct-mapped caches)
  3. capacity misses: cache has a limited size (primary type of miss for fully associate caches) Screen Shot 2021-06-04 at 1 11 04 PM

fully associative cache

如果在一个Cache集内,任何一个内存地址的数据可以被缓存在任何一个Cache Line里,那么我们成这个cache是Fully Associative compare tags in parallel pros: no conflict misses cons: need hardware comparator for every single entry

set associative cache

计算 set index 大小

log2(number of sets) Screen Shot 2021-06-04 at 2 03 30 PM

Hardware

Screen Shot 2021-06-04 at 2 10 02 PM

block replacement policy

LRU: 记录最近读写过的数据(删除长期不用的缓存数据);cons: with 4-way or greater set assoc, requires complicated hardware and much time to keep track of it 执行LRU缓存策略的一个例子 Screen Shot 2021-06-04 at 2 49 04 PM FIFO
Random: low requirements on temporal locality

average memory access time AMAT

如何选择 associativity/block size/replacement & write policy? AMAT = L1 Hit time + L1 Miss penalty x L1 Miss Rate L1 miss penalty = L2 Hit time + L2 Miss penalty x L2 Miss Rate

1. improve miss penalty——L2

e.g. L1 接不住的数据,放在 L2。L2 也接不住时,可以看到速度明显变慢了。例子参见7个示例科普CPU CACHE: L1和L2缓存大小 (我发现人家的L3和我的L2差不多大...)

2. reduce miss rate

  1. larger cache
  2. more places in the cache to put each block of memory - associativity

3. typical scale

cache size hit time miss rates
L1 tens of KB 1 clock cycle 1-5%
L2 hundreds of KB few clock cycles 10-20%
flaming-cl commented 3 years ago

Lab7 Cache version SP21

已经不记得 Venus 怎么启动了…… 这里愚蠢地重复一下 Venus 的启动步骤(适合 21sp 以后的新版) 1)terminal—> ./tools/venus . -dm 2)browser—> https://venus.cs61c.org/ 3)https://venus.cs61c.org/ —> mount local vmfs 4)enter your key

Exercise 1 — Scenario 1

1. What combination of parameters is producing the hit rate you observe?

ans cache size; step size
why 每次循环,cpu 不会只取一个数,而是一块块地取的 (cache block, aka cache size here) 按照题设参数,cache size 8,step size 也是 8,映射关系为直接映射。因此,每次循环取的那组数,占满了 cache,所以缓存白费,hit rate 始终为 0

2. What is our hit rate if we increase Rep Count arbitrarily?

ans 0.0
why 增加 rep count 不影响什么,只是把「以 8 为边界取数的操作」多重复了几次

3. How could we modify one program parameter get the highest possible hit rate?

ans array size -> 1
why 这里呢,改 step size 为 1,可以提升命中率到 50 %;但何不耍个流氓呢,把 array size 改为 1,命中率就提升到 75% 了呢!毕竟我就只剩 4 个 (rep count) 相同的数要读了,第一个 compulsary miss,我还能中 3 个,哈哈哈

Exercise 1 — Scenario 2

N-way set associative 的优越性体现出来了,嗯哼,hit rate 有 75 %(和上题耍流氓的结果持平了呢!

1. How many memory accesses are there per iteration of the inner loop (not the one involving Rep Count)?

ans 2
why option 改为 1 了,读一次写一次

2. What is the repeating hit/miss pattern? Write your answer in the form “MMHHMH” and so on, where your response is the shortest pattern that gets repeated.

ans MHHH

3. Keeping everything else the same, what does our hit rate approach as Rep Count goes to infinity? Try it out by changing the appropriate program parameter and letting the code run! Write your answer as a decimal.

ans hit rate 不断靠近 100%
why 不说无限循环,就算 Rep Count 只循环 30 次左右,命中率都有 99% 了,随着 rep count 不断增大,持续迭代把数据一个个缓存起来,会不停增加命中

Exercise 1 — Scenario 3

1. What is the hit rate of our L1 cache? Our L2 cache? Overall? Write your answer in the form “[L1 HR], [L2 HR], [Overall HR]” where each hit rate is a decimal rounded to two places.

ans L1 HR 0.5, L2 HR 0, Overall HR 0.33
why L1 和 L2 Venus 会给出,overall 用 hit (L1 + L2) / access (L1 + L2) 得到

2. How many accesses do we have to the L1 cache total? How many of them are misses? “[L1 HR], [L2 HR], [Overall HR]” where each hit rate is a decimal rounded to two places.

ans L1 accesses 32, L1 misses 16

3. How many accesses do we have to the L2 cache total?

ans L2 accesses 16
why L1 miss 掉的会去 L2 找,因此有 16 次

4. What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same?

ans rep count
why 增加 rep count,多迭代:e.g rep count 为 10,L1 HR 0.5,L2 HR 0.9

5. Do our L1 and L2 hit rates decrease (-), stay the same (=), or increase (+) as we (1) increase the number of blocks in L1, or (2) increase the L1 block size?

ans +number of blocks in L1 —> L1 0.5, L2 0
ans +the L1 block size —> L1 0.5, L2 0.5

Exercise 2

  1. jki, kji
  2. kij, ikj make ex2 跑出来的结果: ijk: n = 1000, 1.554 Gflop/s ikj: n = 1000, 1.362 Gflop/s jik: n = 1000, 1.591 Gflop/s jki: n = 1000, 10.160 Gflop/s kij: n = 1000, 1.092 Gflop/s kji: n = 1000, 7.567 Gflop/s https://stackoverflow.com/questions/41374646/loop-stride-and-cache-line 看跑出来的结果,内层循环 stride 大,还挺影响性能的

Exercise 3

貌似 proj4 要实现一个很菜的 numpy,那就肯定要处理 matrix,这里把 cache blocking 搞清楚,对后面写作业也有帮助。 cache blocking 能提升性能,主要是因为 cache 是按块阅读的 拓展资料:https://www.youtube.com/watch?v=G92BCtfTwOE Screen Shot 2021-06-07 at 9 04 16 AM 从这图能看出来,你按列读 matrix,就完美地 cache miss 了一次又一次 我们可以把第二个matrix 转置一下,把它的 Rows 和 columns 换换,如图所示 Screen Shot 2021-06-07 at 10 05 01 AM

cache blocking

Screen Shot 2021-06-07 at 10 39 24 AM 内部两层迭代是对小 block 的迭代

P1 输出

transpose 100 20
Testing naive transpose: 0.007 milliseconds Testing transpose with blocking: 0.011 milliseconds transpose 1000 20 Testing naive transpose: 1.302 milliseconds Testing transpose with blocking: 0.672 milliseconds transpose 2000 20 Testing naive transpose: 9.193 milliseconds Testing transpose with blocking: 2.864 milliseconds transpose 5000 20 Testing naive transpose: 129.459 milliseconds Testing transpose with blocking: 20.346 milliseconds transpose 10000 20 Testing naive transpose: 732.465 milliseconds Testing transpose with blocking: 99.772 milliseconds

1. At what point does cache blocked version of transpose become faster than the non-cache blocked version?

可以看到,n 很小的时候,缓存阻塞的优点体现不出来,但到 n 为 1000 左右时,故事就不一样了,transpose with blocking 要比不阻塞块一倍了

2. Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?

我猜,n 非常小的时候,运算在 L1 内部搞定了,很快,缓不缓存没太大效果

P2输出

transpose 10000 50 Testing naive transpose: 816.74 milliseconds Testing transpose with blocking: 105.898 milliseconds transpose 10000 100 Testing naive transpose: 782.371 milliseconds Testing transpose with blocking: 100.63 milliseconds transpose 10000 500 Testing naive transpose: 790.1 milliseconds Testing transpose with blocking: 76.035 milliseconds transpose 10000 1000 Testing naive transpose: 784.74 milliseconds Testing transpose with blocking: 102.683 milliseconds transpose 10000 5000 Testing naive transpose: 787.676 milliseconds Testing transpose with blocking: 720.461 milliseconds

3. How does performance change as block size increases? Why is this the case?

增加 block size,性能变化呈 U 型 block size 不断增大,性能反降,原因quote前面的笔记: larger block size; larger miss penalty (也需要更多时间重新 load 一个 new block

sgzerolc commented 3 years ago

exercise1

Suppose we have a program that iterates through a very large array (i.e. way bigger than the size of the cache) Rep Count times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). For reference, in this scenario, we just had one function (incrementation) and one Rep.

checkoff Q5: Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others, i.e. it doesn’t matter if Rep k is applied to element arr[i+1] before Rep k is applied to element arr[i], etc.

we should try to access __ of the array at a time and apply all of the _ to that __ so we can be completely done with it before moving on, thereby keeping that _ hot in the cache and not having to circle back to it later on!

有一个很大的数列,对其中每个元素进行不同的更改,想要获得和N-way set associative相当的hit rate。逐次访问数列的元素,对其进行更改,然后再继续访问下一个元素。

task5: +the L1 block size —> L1 0.75, L2 0

exercise2

结果如下:

ijk:    n = 1000, 2.178 Gflop/s
ikj:    n = 1000, 4.302 Gflop/s
jik:    n = 1000, 2.261 Gflop/s
jki:    n = 1000, 24.138 Gflop/s
kij:    n = 1000, 3.326 Gflop/s
kji:    n = 1000, 16.772 Gflop/s

best: [jki], [kji] worst: [ijk], [jik] 原因:内层loop的每次循环中的stride步长越长,运行越慢。但我不是很懂为什么存储的元素会在被使用前驱逐造成再次使用时的重复读取?

As it's accessing memory in a vertical pattern with a large stride and will tend to suffer from data being evicted from the fastest but smallest forms of memory before it is used, only to load that chunk of data again redundantly.

https://stackoverflow.com/questions/34189896/cache-performance-concerning-loops-in-c C[i+j*n] += A[i+k*n]*B[k+j*n];

exercise3

transpose 100 20 Testing naive transpose: 0.013 milliseconds Testing transpose with blocking: 0.016 milliseconds transpose 1000 20 Testing naive transpose: 1.164 milliseconds Testing transpose with blocking: 0.828 milliseconds transpose 2000 20 Testing naive transpose: 2.44 milliseconds Testing transpose with blocking: 2.642 milliseconds transpose 5000 20 Testing naive transpose: 84.874 milliseconds Testing transpose with blocking: 15.78 milliseconds transpose 10000 20 Testing naive transpose: 299.992 milliseconds Testing transpose with blocking: 86.597 milliseconds

transpose 10000 20 Testing naive transpose: 299.992 milliseconds Testing transpose with blocking: 86.597 milliseconds transpose 10000 50 Testing naive transpose: 304.823 milliseconds Testing transpose with blocking: 77.366 milliseconds transpose 10000 100 Testing naive transpose: 334.452 milliseconds Testing transpose with blocking: 106.224 milliseconds transpose 10000 500 Testing naive transpose: 435.194 milliseconds Testing transpose with blocking: 91.201 milliseconds transpose 10000 1000 Testing naive transpose: 436.911 milliseconds Testing transpose with blocking: 74.681 milliseconds transpose 10000 5000 Testing naive transpose: 310.619 milliseconds Testing transpose with blocking: 408.995 milliseconds

截屏2021-06-07 下午11 07 44
flaming-cl commented 3 years ago

exercise1

  • scenerio1 task3 checkoff Q2: There are two program parameters you could’ve changed—can you list both?) rep count and number of block. rep count在第一次的时候是cache存满最需要的地址,在之后的次数中重复读取,增加cache hit 的次数。 number of block能够增加cache大小。
  • sce2

Suppose we have a program that iterates through a very large array (i.e. way bigger than the size of the cache) Rep Count times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). For reference, in this scenario, we just had one function (incrementation) and one Rep.

checkoff Q5: Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others, i.e. it doesn’t matter if Rep k is applied to element arr[i+1] before Rep k is applied to element arr[i], etc.

we should try to access __ of the array at a time and apply all of the _ to that __ so we can be completely done with it before moving on, thereby keeping that _ hot in the cache and not having to circle back to it later on!

有一个很大的数列,对其中每个元素进行不同的更改,想要获得和N-way set associative相当的hit rate。逐次访问数列的元素,对其进行更改,然后再继续访问下一个元素。

  • sce3 task4: checkoff Q6: What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? (Checkoff Question 6: Why? rep count.
  • [ ] 我不是很确定Q5
  • [ ] 问题:在有2级level的cache中,L1控制L2, L2控制main memory。在读取数据时,逐次访问数列,将地址不断写到L2,而L1在存满了L2cache的地址后。对数列的再次访问对于L1都属于cache hit么?cache读取数据可以这么描述么,我可能遗漏或记错了一些细节

task5: +the L1 block size —> L1 0.75, L2 0

exercise2

结果如下:

ijk:    n = 1000, 2.178 Gflop/s
ikj:    n = 1000, 4.302 Gflop/s
jik:    n = 1000, 2.261 Gflop/s
jki:    n = 1000, 24.138 Gflop/s
kij:    n = 1000, 3.326 Gflop/s
kji:    n = 1000, 16.772 Gflop/s

best: [jki], [kji] worst: [ijk], [jik] 原因:内层loop的每次循环中的stride步长越长,运行越慢。但我不是很懂为什么存储的元素会在被使用前驱逐造成再次使用时的重复读取?

As it's accessing memory in a vertical pattern with a large stride and will tend to suffer from data being evicted from the fastest but smallest forms of memory before it is used, only to load that chunk of data again redundantly.

https://stackoverflow.com/questions/34189896/cache-performance-concerning-loops-in-c C[i+j*n] += A[i+k*n]*B[k+j*n];

exercise3

transpose 100 20 Testing naive transpose: 0.013 milliseconds Testing transpose with blocking: 0.016 milliseconds transpose 1000 20 Testing naive transpose: 1.164 milliseconds Testing transpose with blocking: 0.828 milliseconds transpose 2000 20 Testing naive transpose: 2.44 milliseconds Testing transpose with blocking: 2.642 milliseconds transpose 5000 20 Testing naive transpose: 84.874 milliseconds Testing transpose with blocking: 15.78 milliseconds transpose 10000 20 Testing naive transpose: 299.992 milliseconds Testing transpose with blocking: 86.597 milliseconds

  • [ ] ques1&2 blocksize不变,当矩阵大小足够大时,cache blocking才更好地表现。我觉得是因为复用同一地址的次数变多了,从而增加了hit rate。

transpose 10000 20 Testing naive transpose: 299.992 milliseconds Testing transpose with blocking: 86.597 milliseconds transpose 10000 50 Testing naive transpose: 304.823 milliseconds Testing transpose with blocking: 77.366 milliseconds transpose 10000 100 Testing naive transpose: 334.452 milliseconds Testing transpose with blocking: 106.224 milliseconds transpose 10000 500 Testing naive transpose: 435.194 milliseconds Testing transpose with blocking: 91.201 milliseconds transpose 10000 1000 Testing naive transpose: 436.911 milliseconds Testing transpose with blocking: 74.681 milliseconds transpose 10000 5000 Testing naive transpose: 310.619 milliseconds Testing transpose with blocking: 408.995 milliseconds

截屏2021-06-07 下午11 07 44
  • [ ] ques3 当矩阵大小不变,增加blocksize,当增加到1000后cache blocking的表现变差。

可以结合问题看看这篇文章 https://zhuanlan.zhihu.com/p/102293437

sgzerolc commented 3 years ago

我理解错了多级cache存储结构,还有一些细节。看了这篇文后我发现我还得翻翻书再看这些问题该怎么回答以及总结。