Issue with inferring shapes in example model

jmitrevs commented 2 years ago

If I create an onnx file with this sample script and input.txt:

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

# Define simple MLP architecture
class MLP(nn.Module):

    def __init__(self):
        super(MLP, self).__init__()
        # Two layer MLP, ingesting a single frame of BLM data
        self.layer1 = nn.Linear(259, 128)
        self.layer2 = nn.Linear(128, 259*2)

    def forward(self, x):
        x = F.relu(self.layer1(x))
        x = torch.sigmoid(self.layer2(x))
        return x

# Training function
def run_inference() -> None:

    # Instantiate the MLP model
    model = MLP()
    # Fix random seed
    np.random.seed(0)

    # Generate weight tensors
    w1 = torch.tensor(np.random.normal(loc=0, scale=0.1, size=(128, 259)).astype(np.single))
    b1 = torch.tensor(np.random.normal(loc=0, scale=0.1, size=128).astype(np.single))

    w2 = torch.tensor(np.random.normal(loc=0, scale=0.1, size=(259*2, 128)).astype(np.single))
    b2 = torch.tensor(np.random.normal(loc=0, scale=0.1, size=259*2).astype(np.single))

    # Single inference step
    with torch.no_grad():

        # Load the fixed weights
        model.layer1.weight = nn.parameter.Parameter(w1)
        model.layer1.bias = nn.parameter.Parameter(b1)

        model.layer2.weight = nn.parameter.Parameter(w2)
        model.layer2.bias = nn.parameter.Parameter(b2)

        # Load the input data and add a batch dimension
        input_data = torch.from_numpy(np.loadtxt('input.txt', dtype=np.single)).unsqueeze(0)

        # Inference
        out = model(input_data)

        # Save in ONNX format
        torch.onnx.export(model,  # model being run
                          input_data,  # model input (or a tuple for multiple inputs)
                          "MLP.onnx")

if __name__ == '__main__':
    run_inference()

(the produced ONNX file is available at: https://drive.google.com/file/d/1wt6ub3cChvPD-XM4-7keuTy5dC5wdVZk/view?usp=sharing)

it seems that infer_shapes from the cleaning fails:

(fastml) mac-137349:validation jmitrevs$ qonnx-cleanup MLP.onnx 
(fastml) mac-137349:validation jmitrevs$ qonnx-exec MLP_clean.onnx 
Traceback (most recent call last):
  File "/Users/jmitrevs/fastml/bin/qonnx-exec", line 33, in <module>
    sys.exit(load_entry_point('qonnx', 'console_scripts', 'qonnx-exec')())
  File "/Users/jmitrevs/work/qonnx/src/qonnx/util/exec_qonnx.py", line 43, in main
    clize.run(exec_qonnx)
  File "/Users/jmitrevs/fastml/lib/python3.9/site-packages/sigtools/modifiers.py", line 158, in __call__
    return self.func(*args, **kwargs)
  File "/Users/jmitrevs/fastml/lib/python3.9/site-packages/clize/runner.py", line 363, in run
    ret = cli(*args)
  File "/Users/jmitrevs/fastml/lib/python3.9/site-packages/clize/runner.py", line 220, in __call__
    return func(*posargs, **kwargs)
  File "/Users/jmitrevs/work/qonnx/src/qonnx/util/exec_qonnx.py", line 35, in exec_qonnx
    odict = execute_onnx(model, idict)
  File "/Users/jmitrevs/work/finn-base/src/finn/core/onnx_exec.py", line 147, in execute_onnx
    raise Exception("Found unspecified tensor shapes, try infer_shapes")
Exception: Found unspecified tensor shapes, try infer_shapes

The problem is that model.get_tensor_shape('Gemm_0_param0') returns []. I do not understand the behavior.

maltanar commented 2 years ago

Thanks for flagging this Jovan. I had a quick look at the testcase, and actually it looks like the problem is not from inside finn-base but rather onnx.shape_inference.infer_shapes which we use under the hood to do shape inference for non-custom ops. I was able to reproduce the same problem in a way that sidesteps finn-base completely:

In [1]: from onnx.shape_inference import infer_shapes

In [2]: import onnx

In [3]: ret0=onnx.load("MLP.onnx")

In [4]: ret1=infer_shapes(ret0)

In [5]: onnx.save(ret1, "mlp-with-shapes.onnx")

...and examining mlp-with-shapes.onnx in Netron I can confirm that the shapes are missing. The good news is, by upgrading to onnx==1.11.0 I was able to get the right shape inference behavior, so this must be some bug that has been fixed in recent versions.

I'll re-run the test-suite with onnx==1.11.0 and if it doesn't break anything, I'll push a fix for this to finn-base and qonnx repos.

jmitrevs commented 2 years ago

It doesn't seem to solve the problem on my mac. I updated onnx versions but still have the problem:

(fastml) mac-137349:Downloads jmitrevs$ qonnx-cleanup MLP.onnx 
(fastml) mac-137349:Downloads jmitrevs$ qonnx-exec MLP_clean.onnx 
Traceback (most recent call last):
  File "/Users/jmitrevs/fastml/bin/qonnx-exec", line 33, in <module>
    sys.exit(load_entry_point('qonnx', 'console_scripts', 'qonnx-exec')())
  File "/Users/jmitrevs/work/qonnx/src/qonnx/util/exec_qonnx.py", line 43, in main
    clize.run(exec_qonnx)
  File "/Users/jmitrevs/fastml/lib/python3.9/site-packages/sigtools/modifiers.py", line 158, in __call__
    return self.func(*args, **kwargs)
  File "/Users/jmitrevs/fastml/lib/python3.9/site-packages/clize/runner.py", line 363, in run
    ret = cli(*args)
  File "/Users/jmitrevs/fastml/lib/python3.9/site-packages/clize/runner.py", line 220, in __call__
    return func(*posargs, **kwargs)
  File "/Users/jmitrevs/work/qonnx/src/qonnx/util/exec_qonnx.py", line 35, in exec_qonnx
    odict = execute_onnx(model, idict)
  File "/Users/jmitrevs/work/finn-base/src/finn/core/onnx_exec.py", line 147, in execute_onnx
    raise Exception("Found unspecified tensor shapes, try infer_shapes")
Exception: Found unspecified tensor shapes, try infer_shapes
(fastml) mac-137349:Downloads jmitrevs$ pip list | grep onnx
onnx                          1.11.0
onnxconverter-common          1.8.1
onnxruntime                   1.11.1
qonnx                         0.0.post1.dev104+gc86147e.d20220531 /Users/jmitrevs/work/qonnx/src
tf2onnx                       1.10.0                              /Users/jmitrevs/work/tensorflow-onnx

maltanar commented 2 years ago

I had only used Netron to check that the shapes appeared for the intermediate tensors, but if I use qonnx-exec I actually see the same problem. The root of this seems to be as follows: even though the weight&bias tensors for the Gemm nodes have initializers, there is no ValueInfo generated for these tensors during shape inference. Since we rely on ValueInfo to get shape information, the Found unspecified tensor shapes exception is thrown during execution.

It looks like this issue has been around for a while and is related to initializers not being listed as inputs: https://github.com/onnx/onnx/issues/4102 https://github.com/onnx/onnx/issues/2874 ...but the following merged PR was supposed to fix this for 1.11.0 and later: https://github.com/onnx/onnx/pull/2901

I'm not entirely sure why the fix hasn't kicked in here. I'll have a closer look.

maltanar commented 2 years ago

I haven't been able to find out why the ONNX PR#2901 does not solve this issue, so I just added a workaround in ModelWrapper to do a fix for this while loading the model.

Since the finn-base is scheduled to be sunset, I did this directly in a new qonnx branch: https://github.com/fastmachinelearning/qonnx/tree/feature/finn_base_migration

@jmitrevs could you give this a try and see if it resolves the issue for you? I was able to use qonnx-cleanup and qonnx_exec without errors on the MLP.onnx you shared.

jmitrevs commented 2 years ago

I believe it fixed the problem. I am now running into another problem, but I think it's unrelated. (I will double-check this afternoon.)

jmitrevs commented 2 years ago

I confirmed, my script now works (after fixing an unrelated bug).

Tayyar commented 2 years ago

@maltanar What's the fix for this issue if using the latest finn-base dev branch? (I tried building a docker with onnx>=1.11.0 but it didn't fix the issue)

Xilinx / finn-base

Issue with inferring shapes in example model #65