Closed liangfu closed 5 years ago
This is a great initiative. Please let me know if you need any help on understadning VTA design elements. Down the road we may want to merge sources and have vendor specific interfaces and pragmas is we rely on HLS. This will ensure that our source remains consistent across different vendors. A DE10 nano reference design is the right start. We may purchase one on our end to validate your design.
@tmoreau89 I'm wondering why your current design is clocked at 100MHz, and it still achieves 7ns execution cycle, as it should N times of 10ns.
@liangfu good question. The 7ns target is passed to HLS to insert pipeline register in the synthesized module in order to close timing. It turns out that if you set the target period to 10ns, the design won’t close timing at 100MHz. Setting a more aggressive target in HLS will insert more registers and therefore allow us to close timing at 100MHz more easily. It’s a bit of a hack but it works. I’m working on changes to the design to ensure improved timing closure. For the intel back end, I would say that this parameter wouldn’t matter if the intel HLS compiler and P&R flow did a better job at closing timing at the specified target.
@tmoreau89 I'm quite curious about how fast VTA can run on top of PYNQ. Can you tell the exact timing in performing inference on the quantized resnet-18 with the designed instructuions? How much time in total are consumed to perform the gemm instruction?
@liangfu when we released VTA, our ResNet-15 inference time was around 0.4 seconds. Upon software and hardware optimizations we can bring inference closer to 0.2 seconds (I'll be releasing an update in the next 10 days or so). Note that this is for a 100MHz design with 256 FMAs, and that we are mostly just offloading GEMM on the FPGA. You can obtain the time breakdown for running each layer individually by running the unit tests on a Pynq FPGA. Do you have one handy?
@liangfu Actually, I too have a plan to port VTA to Intel FPGA, but because you opened this issue, I've been waiting and seeing. Shall I help you with something?
@tmoreau89 Thank you for your answer. I have no PYNQ board at hand, but I do have many kinds of Intel FPGA boards, including de10-nano. Therefore, I have strong willingness to enable VTA on Intel FPGAs. On the either hand, aside from FMA instructions, I think VNNI would be more effective for nn inference.
@ktabata You're always welcome to contribute. I would submit an initial PR later on, so that we can discussion and cooperation on this topic.
Please refer to PR #1694.
@ktabata @liangfu It sounds like there is a willingness to bring VTA support to Intel FPGAs. It makes sense to combine efforts if possible to facilitate this process as much as possible.
In terms of tasks needed to support a new FPGA platforms here are the high-level bits:
This should be enough to run a TVM benchmark that targets an Intel FPGA. Essentially we could subdivide this into (1) Altera HLS port, (2) IP integration with Avalon, and (3) software driver stack.
Are there tasks that you consider to be already mostly completed on your end @liangfu ? I will be ordering a de10-nano to reproduce and validate your results.
It would be great if things can be broken down into a checklist of small items, so @liangfu and @ktabata @tmoreau89 can collaborate on some of the perspectives(either code review or commit changes). Feel free to coordinate on this issue
@tmoreau89 @tqchen Good point to split the task into each workable subtasks.
@tmoreau89 I've been mostly working on migrating the IP cores from Xilinx HLS to Intel HLS (I think OpenCL based implementation would be less efficient and incompatible with VTA.) And I think I can work on the testing and simulation of those functions as well. However, as I got my hands on the driver part. I found mmap
based device communication unfamiliar to me at the moment. Can anyone take the driver migration and system integration task?
@ktabata would you be interested in bringing driver support for Intel FPGAs? I am not very familiar with the software stack on Intel FPGAs, but I'd be happy to provide some guidance.
I have some experience in AOCL, but I'm new to Intel HLS Compiler. Currently I'm finding out which is suitable for VTA (AOCL or HLS). I'm curious to know whether or not host(CPU) SDRAM access from HLS code is possible. @liangfu Do you have information about that?
@ktabata - we could attempt to port VTA to AOCL, but we may not be able to implement latency hiding under this programming model - having separate IP blocks that can operate independently from one another via dependence queues was difficult to achieve with SDAccel - the programming model did not allow us to express "continually executing" modules, which is why we went for HLS + IPI (tcl scripting) flow.
The takeaway here is that we could use AOCL to get a design up and running for Intel FPGAs, but it's likely that if we want to have high efficiency design with latency hiding, we may have to stick to HLS for now.
@ktabata Actually Intel HLS Compiler is new to me as well. It is the high efficiency design it could generate that draws my interest. The techniques include possibilities to implement dependency queues and pipelined parallel execution.
AFAIK, if we need to access SDRAM from FPGA side, we only need to add SDRAM Controller in Qsys. Please refer to GHRD for DE10-nano.
@liangfu - by SDRAM are you referring to accessing DDR? Indeed I am less familiar with how to connect the DMA master to a memory controller in Qsys. This depends on the SoC organization - on Zynq we connect our DMA Axi master to the ACP port of the ARM SoC which enables coherent memory access to the SoC's main memory.
It would be good to maintain the same coherent interface if possible with the Intel SoC as we do for the Xilinx SoC (this will make it easier to keep the TVM runtime consistent between the manufacturers).
@tmoreau89 I completely agree to have coherent design between Intel SoC and Xilinx SoC, and I would utilize AXI master bridge to support VTA. I was just answering to @ktabata 's question on whether it is possible to have Intel HLS access SDRAM.
@tmoreau89 As I'm implementing the driver for the modules, I'm not quite clear about whether the fetch
module loads instruction streams from DDR via cache coherent ACP port, or it could load instructions through the lightweight AXI port. In addition, if it loads instructions via the ACP port, does it require any additional on-chip RAM store the instruction queue?
@liangfu the fetch
module grabs the instructions from DDR via DMA over the ACP port. It pushes the instruction it streams in into FIFO queues, that all take 1 block ram.
If you have vivado installed on your machine, you can build the design, and visualize the board diagram by opening the .xpr project file in the Vivado GUI, and then open the vta_wrapper.bd board design. This will show you how all of the modules are integrated within the Zynq SoC chip.
Hope this helps!
@tmoreau89 Thank you for your help.
Now, I'm aware that the fetch
module uses DMA to grab the instruction streams in DDR, and I have implemented that on de10-nano
. However, it is defined in the Xilinx HLS code that the interface is m_axi
, which should be a axi master interface. As Intel HLS compiles pointers in the argument list into memory mapped interface, do you think I shall transform that into ihc::stream_in
and feed the stream with mSGDMA ?
@tmoreau89 As we are making progress to enable VTA for Intel FPGA, I found it is difficult to debug the generated hardware of the compute
component, and there are several features not yet supported in the Intel HLS. Therefore, I'm wondering whether there is any plan to decompose the compute
component into several smaller components. At least, it could be reasonably constructed with decoders, ALUs and GEMM modules.
@liangfu what are the features that aren't supported in Intel HLS that are making the process challenging? We could consider breaking modules into smaller components to make testing and debugging more easy. What did you have in mind?
closed in favor of the most recent chisel RFC
Objective
Interface Changes
tvm/vta/hardware/intel_fpga
would be created.Actionable Items
fetch
,load
,compute
,store
functions.load
instructiongemm
instructionalu
instructionsstore
instructiontvm/vta/src/pynq
)We welcome further discussions on this topic.