[RFC][VTA] Support Intel FPGA in VTA

liangfu commented 6 years ago

Objective

To make VTA available for intel FPGA.
To demonstrate the quantized resnet18 on DE10-nano.

Interface Changes

tvm/vta/hardware/intel_fpga would be created.
Interface design of the instructions would be kept consistent with current implementation.
Avalon ST interface would be used in Intel FPGA for data streaming

Actionable Items

[x] Implement HLS based fetch, load, compute, store functions.
- [x] load instruction
- [x] gemm instruction
- [x] alu instructions
- [x] store instruction
- [x] implement test functions for the IP cores
[ ] Adopt existing PYNQ board demonstration on DE10-nano
- [ ] implement system integration scripts
- [ ] implement a set of driver utilities (adopt tvm/vta/src/pynq)
- [x] contiguous memory allocation and de-allocation
- [x] functions to read/write to memory-mapped registers
- [ ] function to program the bitstream on the FPGA
- [ ] migrate python script for DE10-nano

We welcome further discussions on this topic.

tmoreau89 commented 6 years ago

This is a great initiative. Please let me know if you need any help on understadning VTA design elements. Down the road we may want to merge sources and have vendor specific interfaces and pragmas is we rely on HLS. This will ensure that our source remains consistent across different vendors. A DE10 nano reference design is the right start. We may purchase one on our end to validate your design.

liangfu commented 6 years ago

@tmoreau89 I'm wondering why your current design is clocked at 100MHz, and it still achieves 7ns execution cycle, as it should N times of 10ns.

tmoreau89 commented 6 years ago

@liangfu good question. The 7ns target is passed to HLS to insert pipeline register in the synthesized module in order to close timing. It turns out that if you set the target period to 10ns, the design won’t close timing at 100MHz. Setting a more aggressive target in HLS will insert more registers and therefore allow us to close timing at 100MHz more easily. It’s a bit of a hack but it works. I’m working on changes to the design to ensure improved timing closure. For the intel back end, I would say that this parameter wouldn’t matter if the intel HLS compiler and P&R flow did a better job at closing timing at the specified target.

liangfu commented 6 years ago

@tmoreau89 I'm quite curious about how fast VTA can run on top of PYNQ. Can you tell the exact timing in performing inference on the quantized resnet-18 with the designed instructuions? How much time in total are consumed to perform the gemm instruction?

tmoreau89 commented 6 years ago

@liangfu when we released VTA, our ResNet-15 inference time was around 0.4 seconds. Upon software and hardware optimizations we can bring inference closer to 0.2 seconds (I'll be releasing an update in the next 10 days or so). Note that this is for a 100MHz design with 256 FMAs, and that we are mostly just offloading GEMM on the FPGA. You can obtain the time breakdown for running each layer individually by running the unit tests on a Pynq FPGA. Do you have one handy?

ossdev-somewhere commented 6 years ago

@liangfu Actually, I too have a plan to port VTA to Intel FPGA, but because you opened this issue, I've been waiting and seeing. Shall I help you with something?

liangfu commented 6 years ago

@tmoreau89 Thank you for your answer. I have no PYNQ board at hand, but I do have many kinds of Intel FPGA boards, including de10-nano. Therefore, I have strong willingness to enable VTA on Intel FPGAs. On the either hand, aside from FMA instructions, I think VNNI would be more effective for nn inference.

@ktabata You're always welcome to contribute. I would submit an initial PR later on, so that we can discussion and cooperation on this topic.

Please refer to PR #1694.

tmoreau89 commented 6 years ago

@ktabata @liangfu It sounds like there is a willingness to bring VTA support to Intel FPGAs. It makes sense to combine efforts if possible to facilitate this process as much as possible.

In terms of tasks needed to support a new FPGA platforms here are the high-level bits:

Porting the IP source from Xilinx HLS to Altera HLS/OpenCL with the Altera specific pragmas
Implementing test functions for those IP cores
Implementing a system integration .tcl script that wires those IP cores together to form a complete FPGA design (here we'd use Avalon-based system integration)
Implement a set of driver utilities that allow for
- contiguous memory allocation and de-allocation, with functions to obtain physical address of the allocated buffer (assuming shared memory system)
- functions to read/write to memory-mapped registers for accessing control registers of IP blocks
- function to program the bitstream on the FPGA

This should be enough to run a TVM benchmark that targets an Intel FPGA. Essentially we could subdivide this into (1) Altera HLS port, (2) IP integration with Avalon, and (3) software driver stack.

Are there tasks that you consider to be already mostly completed on your end @liangfu ? I will be ordering a de10-nano to reproduce and validate your results.

tqchen commented 6 years ago

It would be great if things can be broken down into a checklist of small items, so @liangfu and @ktabata @tmoreau89 can collaborate on some of the perspectives(either code review or commit changes). Feel free to coordinate on this issue

liangfu commented 6 years ago

@tmoreau89 @tqchen Good point to split the task into each workable subtasks.

@tmoreau89 I've been mostly working on migrating the IP cores from Xilinx HLS to Intel HLS (I think OpenCL based implementation would be less efficient and incompatible with VTA.) And I think I can work on the testing and simulation of those functions as well. However, as I got my hands on the driver part. I found mmap based device communication unfamiliar to me at the moment. Can anyone take the driver migration and system integration task?

tmoreau89 commented 6 years ago

@ktabata would you be interested in bringing driver support for Intel FPGAs? I am not very familiar with the software stack on Intel FPGAs, but I'd be happy to provide some guidance.

ossdev-somewhere commented 6 years ago

I have some experience in AOCL, but I'm new to Intel HLS Compiler. Currently I'm finding out which is suitable for VTA (AOCL or HLS). I'm curious to know whether or not host(CPU) SDRAM access from HLS code is possible. @liangfu Do you have information about that?

tmoreau89 commented 6 years ago

@ktabata - we could attempt to port VTA to AOCL, but we may not be able to implement latency hiding under this programming model - having separate IP blocks that can operate independently from one another via dependence queues was difficult to achieve with SDAccel - the programming model did not allow us to express "continually executing" modules, which is why we went for HLS + IPI (tcl scripting) flow.

The takeaway here is that we could use AOCL to get a design up and running for Intel FPGAs, but it's likely that if we want to have high efficiency design with latency hiding, we may have to stick to HLS for now.

liangfu commented 6 years ago

@ktabata Actually Intel HLS Compiler is new to me as well. It is the high efficiency design it could generate that draws my interest. The techniques include possibilities to implement dependency queues and pipelined parallel execution.

AFAIK, if we need to access SDRAM from FPGA side, we only need to add SDRAM Controller in Qsys. Please refer to GHRD for DE10-nano.

tmoreau89 commented 6 years ago

@liangfu - by SDRAM are you referring to accessing DDR? Indeed I am less familiar with how to connect the DMA master to a memory controller in Qsys. This depends on the SoC organization - on Zynq we connect our DMA Axi master to the ACP port of the ARM SoC which enables coherent memory access to the SoC's main memory.

It would be good to maintain the same coherent interface if possible with the Intel SoC as we do for the Xilinx SoC (this will make it easier to keep the TVM runtime consistent between the manufacturers).

liangfu commented 6 years ago

@tmoreau89 I completely agree to have coherent design between Intel SoC and Xilinx SoC, and I would utilize AXI master bridge to support VTA. I was just answering to @ktabata 's question on whether it is possible to have Intel HLS access SDRAM.

liangfu commented 6 years ago

@tmoreau89 As I'm implementing the driver for the modules, I'm not quite clear about whether the fetch module loads instruction streams from DDR via cache coherent ACP port, or it could load instructions through the lightweight AXI port. In addition, if it loads instructions via the ACP port, does it require any additional on-chip RAM store the instruction queue?

tmoreau89 commented 6 years ago

@liangfu the fetch module grabs the instructions from DDR via DMA over the ACP port. It pushes the instruction it streams in into FIFO queues, that all take 1 block ram.

If you have vivado installed on your machine, you can build the design, and visualize the board diagram by opening the .xpr project file in the Vivado GUI, and then open the vta_wrapper.bd board design. This will show you how all of the modules are integrated within the Zynq SoC chip.

Hope this helps!

liangfu commented 6 years ago

@tmoreau89 Thank you for your help.

Now, I'm aware that the fetch module uses DMA to grab the instruction streams in DDR, and I have implemented that on de10-nano. However, it is defined in the Xilinx HLS code that the interface is m_axi, which should be a axi master interface. As Intel HLS compiles pointers in the argument list into memory mapped interface, do you think I shall transform that into ihc::stream_in and feed the stream with mSGDMA ?

liangfu commented 5 years ago

@tmoreau89 As we are making progress to enable VTA for Intel FPGA, I found it is difficult to debug the generated hardware of the compute component, and there are several features not yet supported in the Intel HLS. Therefore, I'm wondering whether there is any plan to decompose the compute component into several smaller components. At least, it could be reasonably constructed with decoders, ALUs and GEMM modules.

tmoreau89 commented 5 years ago

@liangfu what are the features that aren't supported in Intel HLS that are making the process challenging? We could consider breaking modules into smaller components to make testing and debugging more easy. What did you have in mind?

tqchen commented 5 years ago

closed in favor of the most recent chisel RFC

apache / tvm