Closed paulmetzger closed 3 years ago
Yes @paulmetzger, I'd like to merge this in. Before I try it myself, I have a few tweaks in mind:
Can we have different input sizes for simulation and hardware, like here? In simulation, we want a small input for testing purposes, and on hardware we want a medium size to get a meaningful IPC number.
Can we run it on FPGA, check it works for larger sizes, and get the IPC? I can explain how to access an FPGA remotely on slack. I suspect IPC won't be very good due to lines 48-49 which index global memory in a non-aligned way -- a limitation of the current coalescing unit. Therefore, can we rename it to "StencilSlow"?
If you're up for it, a "StencilFast" version would be nice in future, which loads neighbourhoods into shared local memory and only does aligned access to global memory (this used to be the required way to write kernels to get decent performance, but may not be needed in modern GPUs which have more elaborate coalescing strategies).
Hi @mn416 , I renamed the application and added a check whether we are in a simulator or on a FPGA. The Rodinia benchmark that I planned to implement is a 2D stencil kernel and uses local memory but it requires FP support, as I mentioned.
I am not sure if you can get around the non-aligned accesses in a stencil kernel because of the stencil typical neighbourhood access patterns. If you compute NxN blocks then you access overlapping (N+2)x(N+2) blocks, if the size of the neighbourhood is 1. I just tried the special case (N+2) == alignment_requirement (which is number of SIMT lanes?) with some drawings and came to the conclusion that some of the accesses should still be unaligned. Are the overheads of capabilities affected by whether memory accesses are unaligned or not?
@paulmetzger Thanks for the tweaks, let's revisit when we've tried it on FPGA.
It think a solution with completely aligned accesses is doable. Suppose we have a single warp of 32 threads computing a single row of the output, moving from left to right. To compute the outputs at i .. i+31
(aligned with the thread indices in the warp), we need the inputs at i-32 .. i-1
(to the left, also aligned), i .. i+31
(middle, aligned), and i+32 .. i+63
(to the right, also aligned). As an improvement, we can avoid repeated loading of the same data by shifting left/middle/right as we go, i.e. left := middle
, middle := right
, and right := (load next block of 32)
. Does that make any sense? Multiple warps can then be used to compute multiple rows in parallel. There may be some repeated loading between warps (but not within a warp).
It is true that none of this results in different overheads for CHERI v non-CHERI. The efficient version is just nice to show that the hardware prototype is capable of doing something good!
To clarify, I mean the left/middle/right are kept in shared local mem and accessed unaligned (which is fine).
@mn416 , yes, you are right. For some reason I only thought of implementations that add boundary values to the input buffers. For example, if you have a 16x16 input for a stencil with a radius of one, these implementations would use a 18x18 input buffer. The first and last rows and columns would contain boundary values that are required to compute elements at the boundary of the 16x16 grid. One could add checks to the kernel instead and just execute different code for values at the boundaries. The downside is that these checks add overheads but they will likely be amortised by the improvements through aligned memory accesses.
@paulmetzger I'm going to merge this into a new branch pffm2-stencil-slow
as it sounds like you have a more efficient version now that would be nicer to have in master
. I don't want to loose this version though as it will be a useful test if/when we make improvements to the mem subsystem. BTW, feel free to create your own branches in this repo.
@mn416 OK great!
Hi Matt,
I created a simple 2D stencil application, mostly for my education. I don't know if it is interesting enough for you to add it but I thought I let you decide. Happy to make changes or discuss it.
Cheers, Paul