Xilinx / finn

Dataflow compiler for QNN inference on FPGAs
https://xilinx.github.io/finn
BSD 3-Clause "New" or "Revised" License
748 stars 239 forks source link

Obtained different result after calling an IP several times, seems like the data did not work properly inside. #796

Open KeciaHH opened 1 year ago

KeciaHH commented 1 year ago

Hello, here we are facing a weird problem. What we have done:

What we are currently doing: Trying to use the IP on Pynq-z2 board. But when we call an IP several times, it outputs different results. Like this:

截屏2023-04-08 17 09 17

As you can see, the inputs of this IP are exactly the same. But the outputs are similar but different every time, and the first output is exactly correct, but others are not. Seems like it uses some internal data from the previous input. However, if we reinitialize the Overlay every time we want to use it, it works properly but much slower, as shown in the following pic:

截屏2023-04-08 17 18 40

These are the onnx file and final test notebook we used. ipynb&onnx.zip

auphelia commented 1 year ago

Hi @KokyoK , could you please provide more information?

KeciaHH commented 1 year ago

Hello @auphelia ,

Thanks

auphelia commented 1 year ago

Hi @KokyoK, While we're looking from our side if we can reproduce your issue, could you please update the PYNQ version to use 3.0.1 and see if the error persists?

KeciaHH commented 1 year ago

Hi @auphelia , I have tried to use PYNQ 3.0.1 and the same issue still occurs.

fionnodonohoe-xlnx commented 1 year ago

Hi @KokyoK, I am unable to build the bitstream from the provided ONNX file, could you please provide the original trained model? Would it be possible to also share the 'input_permute.npy' file that is used by the notebook? Thanks

KeciaHH commented 1 year ago

Hi @fionnodonohoe-xlnx , here are the files: model.py includes the model structure, which is a normal convolutional model. weight.pt is the trained weights loaded by the model. Also input_permute.npy is provided.

Since we've build a bitstream with the provided ONNX, I guess there might be something wrong when we built bitstream.

model.zip

Thanks for your effort!

fionnodonohoe-xlnx commented 1 year ago

Hi @KokyoK,

I tried creating the ONNX file from model.py. When adding model.save(is_onnx=1) after model.eval() I get the following error: RuntimeError: Given groups=1, weight of size [16, 40, 1, 3], expected input[16, 30, 1, 101] to have 40 channels, but got 30 channels instead

I then changed the expected input to have 40 channels instead of 30 - only to get this error: RuntimeError: input_shape.size() > 0 || reshape.size() > 0INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/onnx/shape_type_inference.cpp":448, please report a bug to PyTorch. Reshape node should have at least one input size > 0 when constant folding.

Are you also seeing this error? Maybe you could send on your TCResNet8.onnx file created from your script. I can try to put that ONNX file through the bitstream generation stage then.

KeciaHH commented 1 year ago

Hello @fionnodonohoe-xlnx ,

  1. The correct way to save .onnx file is putting the following line in the main function:

    model = QuantizedTCResNet8(1, 40, 10)
    model.load("weight.pt")
    model.eval()
    
    import brevitas.onnx as bo
    export_onnx_path = "8b_weight_act_bias_net.onnx"
    input_shape = (1, 40, 1, 101)
    bo.export_finn_onnx(model, input_shape, export_onnx_path)

    and run the main function. The onnx file should be the same as I uploaded before. Sorry for the previous confusion.

  2. We built the bitstream with this file modified build_dataflow_steps.py basically we add line 327, and commented line 338. The file is attached.

build_dataflow_steps.py.zip

fionnodonohoe-xlnx commented 1 year ago

Hi @KokyoK, Thank you for that. I hit another bitstream generation error unfortunately. It turns out that the added TLastMarker in the provided code causes a bitstream generation error for me: + model = model.transform(InsertTLastMarker(both=True)) ... as the TLastMarker class has no method get_input_datatype(). Do you have edits elsewhere in your local clone that circumnavigate this issue?

Here is what I see on the command line when I use the provided build_dataflow_steps.py:

Running step: step_qonnx_to_finn [1/17]
Running step: step_tidy_up [2/17]
Running step: step_streamline [3/17]
Running step: step_convert_to_hls [4/17]
Running step: step_create_dataflow_partition [5/17]
Running step: step_target_fps_parallelization [6/17]
Running step: step_apply_folding_config [7/17]
Running step: step_generate_estimate_reports [8/17]
Running step: step_hls_codegen [9/17]
Running step: step_hls_ipgen [10/17]
Running step: step_set_fifo_depths [11/17]
Running step: step_create_stitched_ip [12/17]
Running step: step_measure_rtlsim_performance [13/17]
Running step: step_out_of_context_synthesis [14/17]
Running step: step_synthesize_bitfile [15/17]
Traceback (most recent call last):
  File "~/workspace/src/finn/builder/build_dataflow.py", line 168, in build_dataflow_cfg
    model = transform_step(model, cfg)
  File "~/workspace/src/finn/builder/build_dataflow_steps.py", line 772, in step_synthesize_bitfile
    model = model.transform(
  File "~/workspace/deps/qonnx/src/qonnx/core/modelwrapper.py", line 140, in transform
    (transformed_model, model_was_changed) = transformation.apply(transformed_model)
  File "~/workspace/src/finn/transformation/fpgadataflow/make_zynq_proj.py", line 350, in apply
    kernel_model = kernel_model.transform(InsertFIFO())
  File "~/workspace/deps/qonnx/src/qonnx/core/modelwrapper.py", line 140, in transform
    (transformed_model, model_was_changed) = transformation.apply(transformed_model)
  File "~/workspace/src/finn/transformation/fpgadataflow/insert_fifo.py", line 199, in apply
    dtype = n0.get_input_datatype(inp_ind)
  File "~/workspace/src/finn/custom_op/fpgadataflow/hlscustomop.py", line 711, in get_input_datatype
    raise Exception("get_input_datatype not implemented for this op")
Exception: get_input_datatype not implemented for this op
> ~/workspace/src/finn/custom_op/fpgadataflow/hlscustomop.py(711)get_input_datatype()
-> raise Exception("get_input_datatype not implemented for this op")
KeciaHH commented 1 year ago

Hi @fionnodonohoe-xlnx , You can simply remove all code related to TLAST in the code. Remove this line: model = model.transform(InsertTLastMarker(both=True)) As this line is not related to this issue. We've tried to remove it and the issue still occurs.

fionnodonohoe-xlnx commented 1 year ago

Hi @KokyoK , I went ahead and removed the TLastMarker insertion point. The bitstream failed to generate this time due to a lack of resources available for the Pynq FPGA part. I then removed all changes from the modified build_dataflow_steps.py and retried building but to no avail. I have attached the DRC report. As you were able to generate a bitstream for this model, how did you get around this particular resourcing issue? Thanks. top_wrapper_drc_opted.txt

KeciaHH commented 1 year ago

Hi @fionnodonohoe-xlnx , We tried again and did not meet your problem. I attached the build_customize folder, and we just tried this build with no error. This is the attachment: https://drive.google.com/file/d/1yfMkpSVOmBp5GrzpWn62daF1Um2d_-RO/view?usp=sharing