Open flaming-cl opened 3 years ago
已经不记得 Venus 怎么启动了…… 这里愚蠢地重复一下 Venus 的启动步骤(适合 21sp 以后的新版) 1)terminal—> ./tools/venus . -dm 2)browser—> https://venus.cs61c.org/ 3)https://venus.cs61c.org/ —> mount local vmfs 4)enter your key
ans | cache size; step size |
---|---|
why | 每次循环,cpu 不会只取一个数,而是一块块地取的 (cache block, aka cache size here) 按照题设参数,cache size 8,step size 也是 8,映射关系为直接映射。因此,每次循环取的那组数,占满了 cache,所以缓存白费,hit rate 始终为 0 |
ans | 0.0 |
---|---|
why | 增加 rep count 不影响什么,只是把「以 8 为边界取数的操作」多重复了几次 |
ans | array size -> 1 |
---|---|
why | 这里呢,改 step size 为 1,可以提升命中率到 50 %;但何不耍个流氓呢,把 array size 改为 1,命中率就提升到 75% 了呢!毕竟我就只剩 4 个 (rep count) 相同的数要读了,第一个 compulsary miss,我还能中 3 个,哈哈哈 |
N-way set associative 的优越性体现出来了,嗯哼,hit rate 有 75 %(和上题耍流氓的结果持平了呢!
ans | 2 |
---|---|
why | option 改为 1 了,读一次写一次 |
ans | MHHH |
---|
ans | hit rate 不断靠近 100% |
---|---|
why | 不说无限循环,就算 Rep Count 只循环 30 次左右,命中率都有 99% 了,随着 rep count 不断增大,持续迭代把数据一个个缓存起来,会不停增加命中 |
ans | L1 HR 0.5, L2 HR 0, Overall HR 0.33 |
---|---|
why | L1 和 L2 Venus 会给出,overall 用 hit (L1 + L2) / access (L1 + L2) 得到 |
ans | L1 accesses 32, L1 misses 16 |
---|
ans | L2 accesses 16 |
---|---|
why | L1 miss 掉的会去 L2 找,因此有 16 次 |
ans | rep count |
---|---|
why | 增加 rep count,多迭代:e.g rep count 为 10,L1 HR 0.5,L2 HR 0.9 |
ans | +number of blocks in L1 —> L1 0.5, L2 0 |
---|---|
ans | +the L1 block size —> L1 0.5, L2 0.5 |
貌似 proj4 要实现一个很菜的 numpy,那就肯定要处理 matrix,这里把 cache blocking 搞清楚,对后面写作业也有帮助。 cache blocking 能提升性能,主要是因为 cache 是按块阅读的 拓展资料:https://www.youtube.com/watch?v=G92BCtfTwOE 从这图能看出来,你按列读 matrix,就完美地 cache miss 了一次又一次 我们可以把第二个matrix 转置一下,把它的 Rows 和 columns 换换,如图所示
内部两层迭代是对小 block 的迭代
transpose 100 20
Testing naive transpose: 0.007 milliseconds
Testing transpose with blocking: 0.011 milliseconds
transpose 1000 20
Testing naive transpose: 1.302 milliseconds
Testing transpose with blocking: 0.672 milliseconds
transpose 2000 20
Testing naive transpose: 9.193 milliseconds
Testing transpose with blocking: 2.864 milliseconds
transpose 5000 20
Testing naive transpose: 129.459 milliseconds
Testing transpose with blocking: 20.346 milliseconds
transpose 10000 20
Testing naive transpose: 732.465 milliseconds
Testing transpose with blocking: 99.772 milliseconds
可以看到,n 很小的时候,缓存阻塞的优点体现不出来,但到 n 为 1000 左右时,故事就不一样了,transpose with blocking 要比不阻塞块一倍了
我猜,n 非常小的时候,运算在 L1 内部搞定了,很快,缓不缓存没太大效果
transpose 10000 50 Testing naive transpose: 816.74 milliseconds Testing transpose with blocking: 105.898 milliseconds transpose 10000 100 Testing naive transpose: 782.371 milliseconds Testing transpose with blocking: 100.63 milliseconds transpose 10000 500 Testing naive transpose: 790.1 milliseconds Testing transpose with blocking: 76.035 milliseconds transpose 10000 1000 Testing naive transpose: 784.74 milliseconds Testing transpose with blocking: 102.683 milliseconds transpose 10000 5000 Testing naive transpose: 787.676 milliseconds Testing transpose with blocking: 720.461 milliseconds
增加 block size,性能变化呈 U 型 block size 不断增大,性能反降,原因quote前面的笔记: larger block size; larger miss penalty (也需要更多时间重新 load 一个 new block
scenerio1 task3 checkoff Q2: There are two program parameters you could’ve changed—can you list both?) rep count and number of block. rep count在第一次的时候是cache存满最需要的地址,在之后的次数中重复读取,增加cache hit 的次数。 number of block能够增加cache大小。
sce2
Suppose we have a program that iterates through a very large array (i.e. way bigger than the size of the cache) Rep Count times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). For reference, in this scenario, we just had one function (incrementation) and one Rep.
checkoff Q5: Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others, i.e. it doesn’t matter if Rep k is applied to element arr[i+1] before Rep k is applied to element arr[i], etc.
we should try to access __ of the array at a time and apply all of the _ to that __ so we can be completely done with it before moving on, thereby keeping that _ hot in the cache and not having to circle back to it later on!
有一个很大的数列,对其中每个元素进行不同的更改,想要获得和N-way set associative相当的hit rate。逐次访问数列的元素,对其进行更改,然后再继续访问下一个元素。
task5: +the L1 block size —> L1 0.75, L2 0
结果如下:
ijk: n = 1000, 2.178 Gflop/s
ikj: n = 1000, 4.302 Gflop/s
jik: n = 1000, 2.261 Gflop/s
jki: n = 1000, 24.138 Gflop/s
kij: n = 1000, 3.326 Gflop/s
kji: n = 1000, 16.772 Gflop/s
best: [jki], [kji] worst: [ijk], [jik] 原因:内层loop的每次循环中的stride步长越长,运行越慢。但我不是很懂为什么存储的元素会在被使用前驱逐造成再次使用时的重复读取?
As it's accessing memory in a vertical pattern with a large stride and will tend to suffer from data being evicted from the fastest but smallest forms of memory before it is used, only to load that chunk of data again redundantly.
https://stackoverflow.com/questions/34189896/cache-performance-concerning-loops-in-c
C[i+j*n] += A[i+k*n]*B[k+j*n];
transpose 100 20 Testing naive transpose: 0.013 milliseconds Testing transpose with blocking: 0.016 milliseconds transpose 1000 20 Testing naive transpose: 1.164 milliseconds Testing transpose with blocking: 0.828 milliseconds transpose 2000 20 Testing naive transpose: 2.44 milliseconds Testing transpose with blocking: 2.642 milliseconds transpose 5000 20 Testing naive transpose: 84.874 milliseconds Testing transpose with blocking: 15.78 milliseconds transpose 10000 20 Testing naive transpose: 299.992 milliseconds Testing transpose with blocking: 86.597 milliseconds
transpose 10000 20 Testing naive transpose: 299.992 milliseconds Testing transpose with blocking: 86.597 milliseconds transpose 10000 50 Testing naive transpose: 304.823 milliseconds Testing transpose with blocking: 77.366 milliseconds transpose 10000 100 Testing naive transpose: 334.452 milliseconds Testing transpose with blocking: 106.224 milliseconds transpose 10000 500 Testing naive transpose: 435.194 milliseconds Testing transpose with blocking: 91.201 milliseconds transpose 10000 1000 Testing naive transpose: 436.911 milliseconds Testing transpose with blocking: 74.681 milliseconds transpose 10000 5000 Testing naive transpose: 310.619 milliseconds Testing transpose with blocking: 408.995 milliseconds
exercise1
- scenerio1 task3 checkoff Q2: There are two program parameters you could’ve changed—can you list both?) rep count and number of block. rep count在第一次的时候是cache存满最需要的地址,在之后的次数中重复读取,增加cache hit 的次数。 number of block能够增加cache大小。
- sce2
Suppose we have a program that iterates through a very large array (i.e. way bigger than the size of the cache) Rep Count times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). For reference, in this scenario, we just had one function (incrementation) and one Rep.
checkoff Q5: Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others, i.e. it doesn’t matter if Rep k is applied to element arr[i+1] before Rep k is applied to element arr[i], etc.
we should try to access __ of the array at a time and apply all of the _ to that __ so we can be completely done with it before moving on, thereby keeping that _ hot in the cache and not having to circle back to it later on!
有一个很大的数列,对其中每个元素进行不同的更改,想要获得和N-way set associative相当的hit rate。逐次访问数列的元素,对其进行更改,然后再继续访问下一个元素。
- sce3 task4: checkoff Q6: What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? (Checkoff Question 6: Why? rep count.
- [ ] 我不是很确定Q5
- [ ] 问题:在有2级level的cache中,L1控制L2, L2控制main memory。在读取数据时,逐次访问数列,将地址不断写到L2,而L1在存满了L2cache的地址后。对数列的再次访问对于L1都属于cache hit么?cache读取数据可以这么描述么,我可能遗漏或记错了一些细节
task5: +the L1 block size —> L1 0.75, L2 0
exercise2
结果如下:
ijk: n = 1000, 2.178 Gflop/s ikj: n = 1000, 4.302 Gflop/s jik: n = 1000, 2.261 Gflop/s jki: n = 1000, 24.138 Gflop/s kij: n = 1000, 3.326 Gflop/s kji: n = 1000, 16.772 Gflop/s
best: [jki], [kji] worst: [ijk], [jik] 原因:内层loop的每次循环中的stride步长越长,运行越慢。但我不是很懂为什么存储的元素会在被使用前驱逐造成再次使用时的重复读取?
As it's accessing memory in a vertical pattern with a large stride and will tend to suffer from data being evicted from the fastest but smallest forms of memory before it is used, only to load that chunk of data again redundantly.
https://stackoverflow.com/questions/34189896/cache-performance-concerning-loops-in-c
C[i+j*n] += A[i+k*n]*B[k+j*n];
exercise3
transpose 100 20 Testing naive transpose: 0.013 milliseconds Testing transpose with blocking: 0.016 milliseconds transpose 1000 20 Testing naive transpose: 1.164 milliseconds Testing transpose with blocking: 0.828 milliseconds transpose 2000 20 Testing naive transpose: 2.44 milliseconds Testing transpose with blocking: 2.642 milliseconds transpose 5000 20 Testing naive transpose: 84.874 milliseconds Testing transpose with blocking: 15.78 milliseconds transpose 10000 20 Testing naive transpose: 299.992 milliseconds Testing transpose with blocking: 86.597 milliseconds
- [ ] ques1&2 blocksize不变,当矩阵大小足够大时,cache blocking才更好地表现。我觉得是因为复用同一地址的次数变多了,从而增加了hit rate。
transpose 10000 20 Testing naive transpose: 299.992 milliseconds Testing transpose with blocking: 86.597 milliseconds transpose 10000 50 Testing naive transpose: 304.823 milliseconds Testing transpose with blocking: 77.366 milliseconds transpose 10000 100 Testing naive transpose: 334.452 milliseconds Testing transpose with blocking: 106.224 milliseconds transpose 10000 500 Testing naive transpose: 435.194 milliseconds Testing transpose with blocking: 91.201 milliseconds transpose 10000 1000 Testing naive transpose: 436.911 milliseconds Testing transpose with blocking: 74.681 milliseconds transpose 10000 5000 Testing naive transpose: 310.619 milliseconds Testing transpose with blocking: 408.995 milliseconds
- [ ] ques3 当矩阵大小不变,增加blocksize,当增加到1000后cache blocking的表现变差。
可以结合问题看看这篇文章 https://zhuanlan.zhihu.com/p/102293437
我理解错了多级cache存储结构,还有一些细节。看了这篇文后我发现我还得翻翻书再看这些问题该怎么回答以及总结。
Reading & Other Materials
1. reading
https://www.oschina.net/translate/what-every-programmer-should-know-about-cpu-cache-part2?cmp&p=1 https://coolshell.cn/articles/10249.html
2. other
https://www.youtube.com/watch?v=UCK-0fCchmY
TIO
Data loading (with examples in hardware
1. When hit
1 ) the valid bit is on and 2 ) the tag matches
2. Why use a mux here
correspond to the block index, then choose the exact block data
3. What kind of locality are we taking advantage of ?
Spatial: coz ur block size is more than just one byte (or word) Temporal: coz this is cache (cache stores the data that you use frequently
Load
Write
数据应该写入 Cache 还是内存?
write-through
write-through 的归途始终是内存,只不过 Cache 中有对应数据时,要先写入 Cache 更新一下数据。
问题也就来了,把数据写入内存始终有些慢(相比直接写入 Cache)
这里可以采用 write-back 的方式,来写入数据
write-back
write-back 通常只更新缓存,只有在需要把缓存里面的脏数据交换出去的时候,才同步已有数据到主内存里。在缓存经常会命中的情况下,性能更好。 脏:Cache block 中的数据和主内存中的不一致,Cache block 中数据已更新,主内存中的相应数据还没更新
问题:即便我们减少了主内存的写入操作,提高了效率,但多核CPU还会存在缓存一致性的问题。为了解决这个问题,我们要引入一个新概念——MESI协议
解决缓存不一致问题
解决这个问题要达到两个前提条件:
1. write propagation
任一 CPU 核心中有缓存更新,其他 CPU 核心能够知晓这一变化
2. transaction serialization
任一CPU 核心中有缓存更新的批量操作,其他 CPU 核心能够执行统一的批量操作顺序。这样的事务串行操作有两个前提:1)write propagation 2) 两个 CPU 核心中的同一个数据 Cache,在拿到对应的 Cache block 锁之后,才进行更新(这话很绕,不好意思)
Bus Snooping
为了解决缓存不一致的问题,我们可以构建一个 bus,bus 记录了我们数据读写的所有请求。接着由各个 CPU 核心自己去 bus 查看请求情况(这个办法 Intel CPU 常用
write invalidate
invalidate:字面意思,数据已失效 指派一个 CPU 核心负责写入数据,而其他的 CPU 核抄袭这个 CPU 写过的数据
这个 CPU 核把数据写入 Cache 后,它会广播一个 invaliadate 请求,通知其他核心哪一个内存地址的缓存失效了。其他核自行判断是否也有相应的 invaliadate cache block,自行标记。
write Broadcast
众生平等,多个 CPU 核,不管谁要写入数据,所以 CPU 核都会收到写入请求的广播,同时更新各自的 Cache 数据。 嗯,上述操作简单粗暴,粗暴之下则可能有所牺牲——占用更多总线带宽, write broadcast 比 write invalidate 多了一步数据内容传输
MESI
M: modified E: exclusive s: shared i: invalidated 在独占状态下,对应的 Cache Line 只加载到了当前 CPU 核所拥有的 Cache 里。其他的 CPU 核,并没有加载对应的数据到自己的 Cache 里。这个时候,如果要向独占的 Cache Block 写入数据,我们可以自由地写入数据,而不需要告知其他 CPU 核。
在独占状态下的数据,如果收到了一个来自于总线的读取对应缓存的请求,它就会变成共享状态。这个共享状态是因为,这个时候,另外一个 CPU 核心,也把对应的 Cache Block,从内存里面加载到了自己的 Cache 里来。
在共享状态下,因为同样的数据在多个 CPU 核心的 Cache 里都有。所以,当我们想要更新 Cache 里面的数据的时候,不能直接修改,而是要先向所有的其他 CPU 核心广播一个请求,要求先把其他 CPU 核心里面的 Cache,都变成无效的状态,然后再更新当前 Cache 里面的数据。这个广播操作,一般叫作 RFO(Request For Ownership),也就是获取当前对应 Cache Block 数据的所有权。
Block size trade off
pros
读取某个word,当 block size 更大时,spatial locality 可能使用得更充分
good for store-program concept/sequential array accesses
cons
larger miss penalty: 要多时间重新 load 一个 new block block size is too big relative to cache size: miss rate goes up Extreme e.g.: one big block
misses (two important factors to consider: infinite-size or finite; fully-associated or finite-associated)
fully associative cache
如果在一个Cache集内,任何一个内存地址的数据可以被缓存在任何一个Cache Line里,那么我们成这个cache是Fully Associative compare tags in parallel pros: no conflict misses cons: need hardware comparator for every single entry
set associative cache
计算 set index 大小
log2(number of sets)
Hardware
block replacement policy
LRU: 记录最近读写过的数据(删除长期不用的缓存数据);cons: with 4-way or greater set assoc, requires complicated hardware and much time to keep track of it 执行LRU缓存策略的一个例子 FIFO
Random: low requirements on temporal locality
average memory access time AMAT
如何选择 associativity/block size/replacement & write policy? AMAT = L1 Hit time + L1 Miss penalty x L1 Miss Rate L1 miss penalty = L2 Hit time + L2 Miss penalty x L2 Miss Rate
1. improve miss penalty——L2
e.g. L1 接不住的数据,放在 L2。L2 也接不住时,可以看到速度明显变慢了。例子参见7个示例科普CPU CACHE: L1和L2缓存大小 (我发现人家的L3和我的L2差不多大...)
2. reduce miss rate
3. typical scale