Open antonysigma opened 3 years ago
Hi @antonysigma. Thanks for your interest.
In the future release, we will support data movement under a specific loop axis, and you can combine it with loop tiling/reordering to realize the computation you described. To be more concrete, please see the following code example:
def one_stage(A):
B = hcl.compute(A.shape, lambda x, y: A[x, y] + A[x + 1, y]
+ A[x, y + 1] + A[x + 1, y + 1], "B")
return B
s = hcl.create_schedule([A], one_stage)
# Define a mock-up target
target = hcl.Platform.zcu102
target.config(compiler="vitis", backend="vhls")
# Split the image into tiles of size 2x2
s_B = one_stage.B
yo, yi, xo, xi = s[s_B].tile(axis=[0,1], factor=[2,2])
s[s_B].reorder([yo, xo, yi, xi])
# Move input from host to FPGA accelerator and
# store the input (tile) under loop axis yi inside a local on-chip buffer
s.to(A, target.xcel).to(s_B, axis=yi)
# Move the output from FPGA to host when the convolution on input tile is done
s.to(s_B, target.host, axis=yi)
In other words, the substages for producing and consuming image tiles would be inferred by HCL compiler automatically based on the information provided by .to() primitive. Right now the master branch of HCL only provides preliminary support for .to() to move the entire tensor between host and accelerator, but we will release a new version of HCL very soon to support this feature. Stay tuned!
Thank you @Hecmay for the prompt reply! For sure, I look forward to the data movement customization by the loop axis.
It is also very helpful to see an example code at this stage. When the new feature is delivered on Github, I will be curious about how the order of the following calls influence the data transfer mechanisms.
s.to(s_B, axis=yi).to(s_B, target.host, axis=yi)
s.to(s_B, target.host, axis=yi).to(s_B, axis=yi)
Hi HeteroCL developers,
I came across a similar tutorial in the project
Halide-HLS
, in which they customized the 2D convolution algorithm by (1) split the large image into tiles in the host (Zynq ARM64), and then (2) send the tiles to the accelerator (Zynq FPGA) to run the convolution steps. The processed tiles are sent back to the host for tile stitching.Reference: https://github.com/jingpu/Halide-HLS/blob/905d2f2ad560246673ba3a84b8a6d8be308e481f/apps/hls_examples/gaussian_hls/pipeline.cpp#L103-L107
I wonder how we can describe such a customization with the HeteroCL scheduling syntax, without rewriting the algorithm?
In other words, how do I "split" the stage
B
into sub-stagestile_producer
andtile_consumer
, like the following pseudo-code? Or, should I explicitly describe the sub-stages in order to utilize thehcl.to()
syntax?