CTSRD-CHERI / SIMTight

Synthesisable SIMT-style RISC-V GPGPU
27 stars 8 forks source link

Added a simple 2D stencil application #4

Closed paulmetzger closed 3 years ago

paulmetzger commented 3 years ago

Hi Matt,

I created a simple 2D stencil application, mostly for my education. I don't know if it is interesting enough for you to add it but I thought I let you decide. Happy to make changes or discuss it.

Cheers, Paul

mn416 commented 3 years ago

Yes @paulmetzger, I'd like to merge this in. Before I try it myself, I have a few tweaks in mind:

  1. Can we have different input sizes for simulation and hardware, like here? In simulation, we want a small input for testing purposes, and on hardware we want a medium size to get a meaningful IPC number.

  2. Can we run it on FPGA, check it works for larger sizes, and get the IPC? I can explain how to access an FPGA remotely on slack. I suspect IPC won't be very good due to lines 48-49 which index global memory in a non-aligned way -- a limitation of the current coalescing unit. Therefore, can we rename it to "StencilSlow"?

If you're up for it, a "StencilFast" version would be nice in future, which loads neighbourhoods into shared local memory and only does aligned access to global memory (this used to be the required way to write kernels to get decent performance, but may not be needed in modern GPUs which have more elaborate coalescing strategies).

paulmetzger commented 3 years ago

Hi @mn416 , I renamed the application and added a check whether we are in a simulator or on a FPGA. The Rodinia benchmark that I planned to implement is a 2D stencil kernel and uses local memory but it requires FP support, as I mentioned.

I am not sure if you can get around the non-aligned accesses in a stencil kernel because of the stencil typical neighbourhood access patterns. If you compute NxN blocks then you access overlapping (N+2)x(N+2) blocks, if the size of the neighbourhood is 1. I just tried the special case (N+2) == alignment_requirement (which is number of SIMT lanes?) with some drawings and came to the conclusion that some of the accesses should still be unaligned. Are the overheads of capabilities affected by whether memory accesses are unaligned or not?

mn416 commented 3 years ago

@paulmetzger Thanks for the tweaks, let's revisit when we've tried it on FPGA.

It think a solution with completely aligned accesses is doable. Suppose we have a single warp of 32 threads computing a single row of the output, moving from left to right. To compute the outputs at i .. i+31 (aligned with the thread indices in the warp), we need the inputs at i-32 .. i-1 (to the left, also aligned), i .. i+31 (middle, aligned), and i+32 .. i+63 (to the right, also aligned). As an improvement, we can avoid repeated loading of the same data by shifting left/middle/right as we go, i.e. left := middle, middle := right, and right := (load next block of 32). Does that make any sense? Multiple warps can then be used to compute multiple rows in parallel. There may be some repeated loading between warps (but not within a warp).

It is true that none of this results in different overheads for CHERI v non-CHERI. The efficient version is just nice to show that the hardware prototype is capable of doing something good!

mn416 commented 3 years ago

To clarify, I mean the left/middle/right are kept in shared local mem and accessed unaligned (which is fine).

paulmetzger commented 3 years ago

@mn416 , yes, you are right. For some reason I only thought of implementations that add boundary values to the input buffers. For example, if you have a 16x16 input for a stencil with a radius of one, these implementations would use a 18x18 input buffer. The first and last rows and columns would contain boundary values that are required to compute elements at the boundary of the 16x16 grid. One could add checks to the kernel instead and just execute different code for values at the boundaries. The downside is that these checks add overheads but they will likely be amortised by the improvements through aligned memory accesses.

mn416 commented 3 years ago

@paulmetzger I'm going to merge this into a new branch pffm2-stencil-slow as it sounds like you have a more efficient version now that would be nicer to have in master. I don't want to loose this version though as it will be a useful test if/when we make improvements to the mem subsystem. BTW, feel free to create your own branches in this repo.

paulmetzger commented 3 years ago

@mn416 OK great!