Open pkdeep opened 7 months ago
Hello,
I have the same error in the same example with a custom network #936
Even I was trying my custom network (Fully connected network , with weight 2 bit and activation 2 bit), got the error, so tried the example notebook. Got the same error in example net also.
On Sun, 10 Dec 2023, 19:05 Batuhan Akkus, @.***> wrote:
Hello,
I have the same error in the same example with custom network #936 https://github.com/Xilinx/finn/issues/936
— Reply to this email directly, view it on GitHub https://github.com/Xilinx/finn/issues/938#issuecomment-1848967586, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS2KL4GNLEARCG2TMIH5AI3YIW3B7AVCNFSM6AAAAABAOQXE62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBYHE3DONJYGY . You are receiving this because you authored the thread.Message ID: @.***>
Could you please share more details on your model, i.e., the initial onnx graph right after export and the one right before the failing transformation? Might indeed be the same issue as #936, but without seeing the graph it is not possible to tell.
Model.txt
Hi @iksnagreb ,
Please find the attached images of the models, 1 just after the export to onnx, and 1 just before the failing transformation (model.transform(CreateDataflowPartition()))
The first image at the top is of the Model just after export to onnx The second image is of the model just before failing transformation. Attached the model details in .txt file. I used the below code for model export (input to the model is float values between 0-1 with shape of (1x1024)):
import onnx
import brevitas.onnx as bo
t1 = tr.model.eval()
a = torch.rand(1, 1, 1,1024)
bo.export_finn_onnx(t1, a, "qnn_w2_a2_self.onnx")
Let me know if you require anything else.
Hm, for some reason your MatMul layers are not converted to the corresponding HLS layer MatrixVectorActivation
, while in turn the related MultiThreshold layers are converted to standalone HLS layers Thresholding_Batch
. The CreateDataflowPartition
transformation, however, expects a continuous chain of purely HLS layers while your model now has an alternating chain of HLS (the Thresholding_Batch) and standard onnx (i.e., the MatMul) layers.
You are using the bnn-pynq notebooks as is, just loading the w2a2 model at the start? You changed nothing else? Then the problem is very likely that this example notebook is intended for binarized (or bipolar) neural networks. By loading the 2-bit variant, this is not the case any more for you. Thus the ConvertBipolarMatMulToXnorPopcount
and consequently the InferBinaryMatrixVectorActivation
transformations (in the two cells before the failing one) will not work, leaving the MatMul layers there.
You have two options now: Either stick to the binarized model to follow the example notebook as it is, or adapt the "Conversion to HLS layers" cells such that they work with the 2-bit (or even more bit) models. For the second option, I suggest you have a look at the InferQuantizedMatrixVectorActivation
transformation to replace the InferBinaryMatrixVectorActivation
. You might have to adapt some parts of the streamlining and pre-/post-processing as well.
You are using the bnn-pynq notebooks as is, just loading the w2a2 model at the start? You changed nothing else? :: Yes
I need to use w2a2 or w2a4 (as my intended final network is not giving good accuracy below this). So I have to use either w2a2 or w2a4 configuration.
"InferBinaryMatrixVectorActivation" how to call or use it? any references?
(P.S.:: When I was was using w1a1 configuration , I was able to successfully run the code)
Thanks
Look into the first code cell of the "Conversion to HLS layers" section in the notebook. The third line should be: model = model.transform(to_hls.InferBinaryMatrixVectorActivation("decoupled"))
. Start by replacing this call by InferQuantizedMatrixVectorActivation
and see whether the rest of the notebook works again. It might not be the only change necessary though, but I expect it brings you at least through the dataflow partition.
I did the changes suggested by you and could move forward. Thanks a lot for the help. The attached file shows the model. I'll get back to you with a detailed report tomorrow
Thanks once again
.
@iksnagreb Hi, I have been able to complete the run and generate the Pynq Driver. I also ran the verification notebook, where my results are matching. Thanks for your support However, when I am running on the board, my results are not matching? Any hint as to why that might be happening? My running_weight folder is empty? is that expected behavior? Any help will be greatly appreciated. P.S.
Nice to hear you are making some progress. Yes, empty runtime weights should be expected in this case.
Regarding the mismatch when running on the board: I am just guessing, but likely the dataflow partition moved the input quantization (and maybe the output de-quantization, if you are expecting floating-point outputs) out of the hardware design, such that what is running on the device is purely integer, thus you are getting only integers back. That means, you probably have to quantize your inputs manually (and maybe de-quantize your outputs manually as well, or, alternatively compare against quantized expected outputs for verification).
Thanks for the reply. Can you provide any example code or links, which can be useful to debug this issue? I am unaware of how to locate different design parts between HW (PL) and SW(PS). Calls which can be useful for getting insight may be very handy.
That means, you probably have to quantize your inputs manually (and maybe de-quantize your outputs manually as well, or, alternatively compare against quantized expected outputs for verification).-------> How to retrieve the information required to do this manually (like mean and dynamic range)
To see how FINN partitioned your model, you can have a look into the parent model after creating the dataflow partition: The original notebook saves this as build_dir+"/tfc_w1_a1_dataflow_parent.onnx"
(you might have changed this). This model graph should look rather simple, containing a StreamingDataflowPartition
in the center (that is the part viewed in the next cell of the example notebook). Everything inside this StreamingDataflowPartition
will be placed into the hardware design, everything outside of it not. What you have shown above is probably just the part inside of the partition.
Guessing from the last model graph you provided, it seems to be everything up to (and including) the first MultiThreshold which is not included in the hardware - this makes sense as it corresponds to the input quantization. Normally, you would now have to figure out which Quant node (probably just the first one as well) originally corresponds to this MultiThreshold and use the quantization parameters (scale, zero-point, etc.) from there. However, it looks like you have some Mul and Add nodes (we do not care for the Reshape here) preceding the MultiThreshold, which suspiciously look like a conversion to bipolar inputs. Bipolar inputs do not really make sense for your 2w2a model. Is this still the case? You might want to check again, whether your inputs and outputs (as the Mul and Add following the last MatMul look suspiciously like the reverse of the bipolar-conversion) are treated correctly or whether there are still some leftovers form the binary/bipolar example in there.
Hi @iksnagreb , finally the code is working fine on the board. I was able to push the input Quant layer inside the dataflow partition and was able to convert the INT output of dataflow partition to FLOAT by using the values. Thanks a lot for your help.
Now moving forward I want to play around with the folding factors and performance enhancement. I have few queries regarding this. If you can answer them or direct me to relevant section, it will be of great help:
1) When I try to increase the PE and SIMD values, I get an error and the flow does not processed. Generally the errors are not detailed and it is difficult to make something of these errors. Is there any better way by which I can produce or debug these? (Current msg while Synthesis error : ) (log file attached runme.log )
Finished Part Resource Summary
---------------------------------------------------------------------------------
/opt/Xilinx/Vivado/2022.2/bin/rdiArgs.sh: line 312: 49498 Killed "$RDI_PROG" "$@"
Parent process (pid 49498) has died. This helper process will now exit
2) How to get latency numbers of the implementation? 3) I sometimes get following error when I try to use "cybersecurity/3-build-accelerator-with-finn.ipynb", which I am not able to understand
ERROR: [HLS 207-2163] 'bitwidth' attribute requires integer constant between 1 and 8191 inclusive (/opt/Xilinx/Vitis_HLS/2022.2/common/technology/autopilot/ap_common.h:520:35)
INFO: [HLS 207-4518] in instantiation of template class 'ssdm_int<16384, false>' requested here (/opt/Xilinx/Vitis_HLS/2022.2/common/technology/autopilot/ap_int_base.h:108:29)
INFO: [HLS 207-4518] in instantiation of template class 'ap_int_base<16384, false>' requested here (/opt/Xilinx/Vitis_HLS/2022.2/common/technology/autopilot/ap_int.h:181:18)
INFO: [HLS 207-4518] in instantiation of template class 'ap_uint<16384>' requested here (/tmp/finn_dev_pradeep/code_gen_ipgen_MatrixVectorActivation_1_ra_wykn9/top_MatrixVectorActivation_1.cpp:38:1)
ERROR: [HLS 207-2163] 'bitwidth' attribute requires integer constant between 1 and 8191 inclusive (/opt/Xilinx/Vitis_HLS/2022.2/common/technology/autopilot/ap_common.h:521:29)
ERROR: [HLS 207-2163] 'bitwidth' attribute requires integer constant between 1 and 8191 inclusive (/opt/Xilinx/Vitis_HLS/2022.2/common/technology/autopilot/ap_common.h:523:104)
ERROR: [HLS 207-2163] 'bitwidth' attribute requires integer constant between 1 and 8191 inclusive (/opt/Xilinx/Vitis_HLS/2022.2/common/technology/autopilot/ap_int.h:212:114)
ERROR: [HLS 207-3337] type 'ap_uint<256U * 32U * ap_int<2>::width>' does not provide a call operator (/home/pradeep/Desktop/finn-brevitas/finn/deps/finn-hlslib/mvau.hpp:268:20)
Thanks
Hi,
1) I'm afraid this error usually points to Vivado running out of RAM during synthesis, most likely due to the larger/more complex design due to the parallelism increase. 2) You can run this transformation to get the estimated latency per layer. The worst latency will determine the overall throughput, but estimating the actual inference latency is not as easy. For an upper bound, this transformation simply adds all latencies together to give you "critical_path_cycles". RTL simulation is usually preferred to get realistic latency figures. If you are using the FINN builder tool, the following step wraps these analysis transformations and dumps the results into .json log files, among some additional information like operator/parameter counts: https://github.com/Xilinx/finn/blob/d980f7cb6aadb1ee7915576f73ea28fea2f22021/src/finn/builder/build_dataflow_steps.py#L440 3) FINN uses AXI-Streams internally to move data around. Parallelism directly impacts the width of these streams and 8192 is the maximum width supported by Vitis HLS. Unfortunately this is a hard limit, so you will have to decrease parallelism and/or model size to make it work. If this only happens in a layer for which an alternative RTL backend exists (such as the ConvolutionInputGenerator), you might be able to avoid this limitation by switching from the HLS to the RTL backend.
Hi, Thanks a lot for the replies and help. I am able to move forward and experiment with different folding options for better performance. Here are a few doubts which I have:
are attached herewith. FINN is creating dangling nodes for other inputs. Any idea how to address this? Thanks once again for your valuable help. Like this ->>
self.qid1 = qnn.QuantIdentity(bit_width=4, return_quant_tensor=True)
self.qlin1 = qnn.QuantLinear(self.input_size, self.hidden2, bias = True, weight_bit_width = weight_bit_width )
self.act1 = qnn.QuantReLU(bit_width=act_bit_width)
self.qid2 = qnn.QuantIdentity(bit_width=4, return_quant_tensor=True)
self.qlin2 = qnn.QuantLinear(self.hidden1, self.hidden2, bias = True, weight_bit_width = weight_bit_width ) #256+64
def forward(self,x1,x2, x3, x4, x5):
x = self.qid1(x1)
x = self.qlin1(x)
x = self.act1(x)
x11 = self.qid2(x2)
x = torch.cat([x, x11], dim=1)
x = self.qlin2(x)
Here are my few observations, that might help someone:
Hi @fpjentzsch , Update to the last comment. Instead of multiple inputs to the model (which seems to be difficult to get converted to hls), I have taken a single combined input and slicing the input to feed to different layers, which seems to be a better option. But when I try to streamline the model, I am still left with few
Mul
and
MatMul
nodes, which dont go away. I am attaching the image of the model, which in turn gives cycle-free graph error when I run "Dataflow" transformation.
Can you help me?
Discussed in https://github.com/Xilinx/finn/discussions/937