bu-icsg / dana

Dynamically Allocated Neural Network Accelerator for the RISC-V Rocket Microprocessor in Chisel
Other
203 stars 36 forks source link

Use UTL to load/store NN configurations #17

Closed seldridge closed 7 years ago

seldridge commented 8 years ago

The current approach of loading an NN configuration through the L1D$ of the Rocket (i.e., through the RoCC interface's mem port) is slow and is likely overwriting a large portion of the data in the L1D$. This type of load/store is much better viewed as an uncached load directly from L2 which can be accomplished using one of the AUTL/UTL ports.

seldridge commented 8 years ago

The UTL interface will, since it's talking directly to the TileLink and therefore L2, need to speak physical addresses. In my understanding, this can be accomplished in one of two ways:

The former approach seems to be the way to go for the following reasons. I think that the address translation that occurs when the accelerator accesses the TLB ports of rocket will happen within the context of the current rocket process. The nature of X-FILES/DANA does not necessitate that a transaction is synchronous with respect to the context of Rocket, e.g., a long running learning transaction from a previous process may need to writeback learned weights to memory. Relying on the TLB port to handle address translation seems like the wrong way to go. Furthermore, it is likely that the size of a neural network configuration will extend beyond a page. This would necessitate additional overhead if the accelerator has to keep track of where it is in a page and then do additional page table walks to get the next page. Avoiding this (with contiguous pages) necessarily involves getting the kernel to explicitly manage the ASID--NNID table and we might as well just have it setup physical addresses.

From what @handong32 has stated, we can use the following functions to setup and implement the former approach:

seldridge commented 8 years ago

@handong32 -- This is likely what's blocking you. Physical reads to the L1 D$ should barf. I did some initial work towards this over a month back (0c62436cae4c8eede37f003623bbc1b93465cbbe), but got stuck as I didn't have any physical memory to test it with. (I tried to add a physical memory ANT into the Proxy Kernel but stopped as I was just replicating your work).

This should be pretty straightforward to get working as Berkeley has an example for using the L2 UTL interface here: https://github.com/ucb-bar/rocket/blob/master/src/main/scala/rocc.scala#L185. Testing is a problem, however, as the C++ model takes ages to boot Linux.