Run Deeplabv3 (resnet 101 backbone) from pytorch framework on Xilinx at good performance metric

Xilinx / pyxir

Apache License 2.0

37 stars 14 forks source link

Run Deeplabv3 (resnet 101 backbone) from pytorch framework on Xilinx at good performance metric #40

Open abdulazizm opened 3 years ago

abdulazizm commented 3 years ago

Trying to build deeplabv3 from PyTorch to deploy on top of Xilinx EDGE device - zcu104. Facing some issues while quantizing. For good metrics as suggested in https://github.com/Xilinx/pyxir/issues/33#issuecomment-829961279 updated model with these changes:

Tried with padding and dilation = (2,2) instead of default (4,4), (12,12), (24,24), (36,36)
Removed dropouts
Adjusted conv2d layer to handle dpu constraint (kernel_w kernel_h (ceil(input_channel / channel_parallel)) <= bank_depth/2) - Resnet 101 backbone has 2048 size conv2d layers (not supported on dpu) - Edited final layer to have 1024 max channels, perfectly eligible to run in dpu.

Every change seems to be not affecting quantization/compilation except changing padding & dilation of (4,4) to (2,2) which is important for good inference metrics.

Attached mod['main'] after mergecompiler -> mod_main_after_mergecompiler.txt

**************************************************
* START GRAPH COMPILATION FOR TARGET: DPUCZDX8G-zcu104
**************************************************
INFO:pyxir:Command:
        dnnc-dpuv2 --parser tensorflow            --frozen_pb /tmp/tmpfstowlv_/deploy_model.pb             --cpu_arch arm64             --output_dir /tmp/tmpfstowlv_/             --net_name xp0             --dcf /home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/contrib/target/components/DPUCZDX8G/./ZCU104.dcf

INFO:pyxir:Output:
Kernel topology "xp0_kernel_graph.jpg" for network "xp0"
kernel list info for network "xp0"
                               Kernel ID : Name
                                       0 : xp0

                             Kernel Name : xp0
--------------------------------------------------------------------------------
                             Kernel Type : DPUKernel
                               Code Size : 2.28MB
                              Param Size : 46.23MB
                           Workload MACs : 77307.97MOPS
                         IO Memory Space : 3.69MB
                              Mean Value : 0, 0, 0,
                      Total Tensor Count : 111
                Boundary Input Tensor(s)   (H*W*C)
                            xinput0:0(0) : 224*224*3

               Boundary Output Tensor(s)   (H*W*C)
             nn_relu_93879679120560:0(0) : 28*28*256
             nn_relu_93879463326768:0(1) : 28*28*256
             nn_relu_93879463325248:0(2) : 28*28*1024
             nn_relu_93879463331328:0(3) : 28*28*256
             nn_relu_93879463329808:0(4) : 28*28*256
             nn_relu_93879463328288:0(5) : 28*28*256

                        Total Node Count : 110
                           Input Node(s)   (H*W*C)
        nn_conv2d_93879632672880_Conv(0) : 224*224*3

                          Output Node(s)   (H*W*C)
        nn_conv2d_93879463325632_Conv(0) : 28*28*256
               nn_relu_93879463325248(0) : 28*28*1024
        nn_conv2d_93879463330192_Conv(0) : 28*28*256
        nn_conv2d_93879463328672_Conv(0) : 28*28*256
        nn_conv2d_93879463327152_Conv(0) : 28*28*256
        nn_conv2d_93879679119424_Conv(0) : 28*28*256

INFO:pyxir:Output names: ['nn.relu-93879463329808', 'nn.relu-93879463328288', 'nn.relu-93879463331328', 'nn.relu-93879463326768', 'nn.relu-93879463325248', 'nn.relu-93879679120560']
Traceback (most recent call last):
  File "compile_pytorch_deeplab.py", line 321, in <module>
    InferenceSession.run()
  File "/workspace/python/tvm/contrib/graph_executor.py", line 206, in run
    self._run()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 322, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 257, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 246, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 160, in tvm._ffi._cy3.core.CALL
tvm._ffi.base.TVMError: AssertionError: Can't retrieve right out tensor names from DNNC compiler output
At:
  /home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/contrib/target/components/DPUCZDX8G/vai_c.py(164): compile
  /home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/contrib/target/components/DPUCZDX8G/zcu104.py(81): xgraph_dpu_zcu104_compiler
  /home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/base.py(156): compile
  /home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/base.py(208): compile_opaque_func
  /home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/opaque_func.py(113): opaque_func_wrapper
  /workspace/python/tvm/contrib/graph_executor.py(206): run
  compile_pytorch_deeplab.py(321): <module>

@jornt-xilinx Need your support. Created a new issue here for easy tracking.

jtuyls commented 3 years ago

Hi @abdulazizm , it looks like the issue is that the . in the output name nn.relu-93879463329808 is not matched with the _ in the DNNC output: nn_relu_93879463329808. The fallback uses output shape matching but as some shapes have the same size, that also fails. I added a fix to the dev branch: https://github.com/Xilinx/pyxir/blob/52d7e7c08e5a143003e87041e00b14481f568b45/python/pyxir/contrib/target/components/DPUCZDX8G/dnnc_output.py#L163 Could you try whether it works for you?

abdulazizm commented 3 years ago

@jtuyls Hope that's a good catch, not getting an assertion error now. But there seem to be some other naming convention issues too. Let me know if I missed anything.

INFO:pyxir:Output names: ['nn.relu-94897781658768', 'nn.relu-94897781657248', 'nn.relu-94897781654208', 'nn.relu-94897781652688', 'nn.relu-94897781655728', 'nn.relu-94897371188960']
Traceback (most recent call last):
  File "compile_pytorch_deeplab.py", line 326, in <module>
    InferenceSession.run()
  File "/workspace/python/tvm/contrib/graph_executor.py", line 206, in run
    self._run()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 322, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 257, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 246, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 160, in tvm._ffi._cy3.core.CALL
tvm._ffi.base.TVMError: NameError: name 're' is not defined
At:
  /home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/contrib/target/components/DPUCZDX8G/dnnc_output.py(162): get_dnnc_str
  /home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/contrib/target/components/DPUCZDX8G/vai_c.py(158): compile
  /home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/contrib/target/components/DPUCZDX8G/zcu104.py(81): xgraph_dpu_zcu104_compiler
  /home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/base.py(156): compile
  /home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/base.py(208): compile_opaque_func
  /home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/opaque_func.py(113): opaque_func_wrapper
  /workspace/python/tvm/contrib/graph_executor.py(206): run
  compile_pytorch_deeplab.py(326): <module>
(vitis-ai-tensorflow) Vitis-AI /workspace/deeplab_code >

Thanks!!

jtuyls commented 3 years ago

@abdulazizm It looks like the regular expressions re package is not there. If you copied the line you might have missed the import?https://github.com/Xilinx/pyxir/blob/52d7e7c08e5a143003e87041e00b14481f568b45/python/pyxir/contrib/target/components/DPUCZDX8G/dnnc_output.py#L18

abdulazizm commented 3 years ago

@jtuyls Thanks for the quick reply. Yeah, I missed copying import into my codebase. Now it builds and exports the library successfully. Will check this on the EDGE device and share the inference metrics with you shortly.

And also I am trying to benchmark the results, could you please let me know If there are any special API to calculate FPS, Latency, Throughput, or any other metrics calculation (from TVM or Pyxir)? Currently, I am calculating the time taken for each inference, by the time difference before and after inferencesession.run() call.

Thanks, Abdul.

abdulazizm commented 3 years ago

@jtuyls Happy to share that we can now inference at 0.35s (earlier it was 3.5s - drastic improvement). this is a great drastic improvement from performance point of view. Thanks for the support and is there any other suggestions from your side to improve performance further? Finetuning suggestions?

root@pynq:/home/xilinx# python3 run_pytorch_deeplab.py
image shape [224, 224]
========================================
Inference time: 346.09 ms | 0.35 s
========================================

Reiterating from previous comment: "And also I am trying to benchmark the results, could you please let me know If there are any special API to calculate FPS, Latency, Throughput, or any other metrics calculation (from TVM or Pyxir)? Currently, I am calculating the time taken for each inference, by the time difference before and after inferencesession.run() call".

jtuyls commented 3 years ago

@abdulazizm Great to hear that the inference performance is 10x from previously now! Still, unless there are quite some expensive operations that you expect will need to be executed on CPU, I think it might be possible to improve further. For this, I would have a look at what parts of the model are offloaded to the DPU and CPU respectively. You can check this by inspecting the TVM module after the PartitionGraph transformation:

from tvm.relay import transform
from tvm.relay.op.contrib.vitis_ai import annotation

mod = annotation(mod, params, dpu_target)
mod = relay.transform.MergeCompilerRegions()(mod)
mod = relay.transform.PartitionGraph()(mod)
print(mod['main']) # -> main TVM function containing all CPU ops and a call to the Vitis AI function (vitis_ai_0)
print(mod['vitis_ai_0']) # -> Function inside TVM containing all operations that will be offloaded to the DPU

mod['vitis_ai_0'] -> will show you all operations that will be executed on the DPU.

mod['main'] -> this is the main TVM function, all operations in here will be executed on the CPU. You should see a call to the vitis_ai_0 function from above in here. As an example, for the MXNet Resnet 18 model this function looks like this:

fn (%data: Tensor[(1, 3, 224, 224), float32]) -> Tensor[(1, 1000), float32] {
%0 = layout_transform(%data, src_layout="NCHW", dst_layout="NHWC") /* ty=Tensor[(1, 224, 224, 3), float32] */;
%1 = @vitis_ai_0(%0) /* ty=Tensor[(1, 1, 1, 512), float32] */;
%2 = layout_transform(%1, src_layout="NHWC", dst_layout="NCHW") /* ty=Tensor[(1, 512, 1, 1), float32] */;
%3 = nn.batch_flatten(%2) /* ty=Tensor[(1, 512), float32] */;
%4 = nn.dense(%3, meta[relay.Constant][0] /* ty=Tensor[(1000, 512), float32] */, units=1000) /* ty=Tensor[(1, 1000), float32] */;
add(%4, meta[relay.Constant][1] /* ty=Tensor[(1000), float32] */) /* ty=Tensor[(1, 1000), float32] */
}

If this main function still contains some expensive operations (conv2d), it might be possible to optimize further by trying to include those operations inside the DPU.

And also I am trying to benchmark the results, could you please let me know If there are any special API to calculate FPS, Latency, Throughput, or any other metrics calculation (from TVM or Pyxir)? Currently, I am calculating the time taken for each inference, by the time difference before and after inferencesession.run() call

We usually calculate latency the same way for experimentation. If you want to really benchmark I would use the TVM time evaluator function

You can use it like this:

...
mod = graph_executor.GraphModule(lib["default"](tvm.cpu()))
mod.set_input(...)

ftimer = mod.module.time_evaluator("run", tvm.cpu(), number=10, repeat=100)
prof_res = np.array(ftimer().results) * 1000
print("%-20s %-19s (%s)" % ("TVM Runtime:", "%.2f ms" % np.mean(prof_res), "%.2f ms" % np.std(prof_res)))

For FPS, we calculate it as number of inputs/total time with a high enough number of inputs (>1000). I am not aware of a utility function like the time_evaluator above for this, but you can do something similar as that function with for example a warm up run.

abdulazizm commented 3 years ago

@jtuyls Thanks for the very detailed reply. Can able to use the TVM time evaluator function, this helps a lot. And used mod['main'] to find out some layers running on the CPU. Updated model (removed dropout layer, changed torch.nn.AdaptiveAvgPool2d layer to torch.nn.AvgPool2d layer), now can able to inference at 234.72 ms (from tvm time evaluator). Great one!!

Still, I can see 3 conv2d layers running in the CPU. Checked with DPU constraints, seems to be fine. Not sure why it isn't offloaded to DPU. Any suggestions?

mod['main'] - output

fn (%data: Tensor[(1, 3, 224, 224), float32]) -> (Tensor[(1, 21, 224, 224), float32], Tensor[(1, 21, 224, 224), float32]) {
  %0 = layout_transform(%data, src_layout="NCHW", dst_layout="NHWC") /* ty=Tensor[(1, 224, 224, 3), float32] */;
  %1 = @vitis_ai_0(%0) /* ty=(Tensor[(1, 28, 28, 21), float32], Tensor[(1, 28, 28, 256), float32], Tensor[(1, 28, 28, 256), float32], Tensor[(1, 28, 28, 256), float32], Tensor[(1, 28, 28, 256), float32], Tensor[(1, 27, 27, 256), float32]) */;
  %2 = %1.0;
  %3 = expand_dims(meta[relay.Constant][0] /* ty=Tensor[(21), float32] */, axis=1, num_newaxis=2) /* ty=Tensor[(21, 1, 1), float32] */;
  %4 = expand_dims(%3, axis=0) /* ty=Tensor[(1, 21, 1, 1), float32] */;
  %5 = layout_transform(%4, src_layout="NCHW", dst_layout="NHWC") /* ty=Tensor[(1, 1, 1, 21), float32] */;
  %6 = add(%2, %5) /* ty=Tensor[(1, 28, 28, 21), float32] */;
  %7 = layout_transform(%6, src_layout="NHWC", dst_layout="NCHW") /* ty=Tensor[(1, 21, 28, 28), float32] */;
  %8 = image.resize(%7, size=[224, 224]) /* ty=Tensor[(1, 21, 224, 224), float32] */;
  %9 = %1.1;
  %10 = layout_transform(%9, src_layout="NHWC", dst_layout="NCHW") /* ty=Tensor[(1, 256, 28, 28), float32] */;
  %11 = %1.2;
  %12 = layout_transform(%11, src_layout="NHWC", dst_layout="NCHW") /* ty=Tensor[(1, 256, 28, 28), float32] */;
  %13 = %1.3;
  %14 = layout_transform(%13, src_layout="NHWC", dst_layout="NCHW") /* ty=Tensor[(1, 256, 28, 28), float32] */;
  %15 = %1.4;
  %16 = layout_transform(%15, src_layout="NHWC", dst_layout="NCHW") /* ty=Tensor[(1, 256, 28, 28), float32] */;
  %17 = %1.5;
  %18 = layout_transform(%17, src_layout="NHWC", dst_layout="NCHW") /* ty=Tensor[(1, 256, 27, 27), float32] */;
  %19 = image.resize(%18, size=[28, 28]) /* ty=Tensor[(1, 256, 28, 28), float32] */;
  %20 = (%10, %12, %14, %16, %19);
  %21 = concatenate(%20, axis=1) /* ty=Tensor[(1, 1280, 28, 28), float32] */;
  %22 = layout_transform(%21, src_layout="NCHW", dst_layout="NHWC") /* ty=Tensor[(1, 28, 28, 1280), float32] */;
  %23 = nn.conv2d(%22, meta[relay.Constant][1] /* ty=Tensor[(256, 1280, 1, 1), float32] */, padding=[0, 0, 0, 0], channels=256, kernel_size=[1, 1], data_layout="NHWC") /* ty=Tensor[(1, 28, 28, 256), float32] */;
  %24 = nn.batch_norm(%23, meta[relay.Constant][2] /* ty=Tensor[(256), float32] */, meta[relay.Constant][3] /* ty=Tensor[(256), float32] */, meta[relay.Constant][4] /* ty=Tensor[(256), float32] */, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */, axis=3) /* ty=(Tensor[(1, 28, 28, 256), float32], Tensor[(256), float32], Tensor[(256), float32]) */;
  %25 = %24.0;
  %26 = nn.relu(%25) /* ty=Tensor[(1, 28, 28, 256), float32] */;
  %27 = nn.conv2d(%26, meta[relay.Constant][6] /* ty=Tensor[(256, 256, 3, 3), float32] */, padding=[1, 1, 1, 1], channels=256, kernel_size=[3, 3], data_layout="NHWC") /* ty=Tensor[(1, 28, 28, 256), float32] */;
  %28 = nn.batch_norm(%27, meta[relay.Constant][7] /* ty=Tensor[(256), float32] */, meta[relay.Constant][8] /* ty=Tensor[(256), float32] */, meta[relay.Constant][9] /* ty=Tensor[(256), float32] */, meta[relay.Constant][10] /* ty=Tensor[(256), float32] */, axis=3) /* ty=(Tensor[(1, 28, 28, 256), float32], Tensor[(256), float32], Tensor[(256), float32]) */;
  %29 = %28.0;
  %30 = nn.relu(%29) /* ty=Tensor[(1, 28, 28, 256), float32] */;
  %31 = nn.conv2d(%30, meta[relay.Constant][11] /* ty=Tensor[(21, 256, 1, 1), float32] */, padding=[0, 0, 0, 0], channels=21, kernel_size=[1, 1], data_layout="NHWC") /* ty=Tensor[(1, 28, 28, 21), float32] */;
  %32 = expand_dims(meta[relay.Constant][12] /* ty=Tensor[(21), float32] */, axis=1, num_newaxis=2) /* ty=Tensor[(21, 1, 1), float32] */;
  %33 = expand_dims(%32, axis=0) /* ty=Tensor[(1, 21, 1, 1), float32] */;
  %34 = layout_transform(%33, src_layout="NCHW", dst_layout="NHWC") /* ty=Tensor[(1, 1, 1, 21), float32] */;
  %35 = add(%31, %34) /* ty=Tensor[(1, 28, 28, 21), float32] */;
  %36 = layout_transform(%35, src_layout="NHWC", dst_layout="NCHW") /* ty=Tensor[(1, 21, 28, 28), float32] */;
  %37 = image.resize(%36, size=[224, 224]) /* ty=Tensor[(1, 21, 224, 224), float32] */;
  (%8, %37)
}

Hope we can merge the changes in "pyxir/python/pyxir/contrib/target/components/DPUCZDX8G/dnnc_output.py" file to the master branch.

jtuyls commented 3 years ago

@abdulazizm Getting the remaining 3 conv2d operations into the DPU should get you another performance improvement indeed. I think the image.resize operation is the culprit as it's not converted to a NHWC operation and is therefore being surrounded by layout_transform operations in the layout transformation pass. I think it will be included in the vitis_ai_0 function if you add following snippet to your script to register image.resize as an operation for which the layout can be transformed:

from tvm.relay import reg
@reg.register_convert_op_layout("image.resize")
def convert_image_resize(attrs, inputs, tinfos, desired_layouts):
    data = inputs
    new_attrs = dict(attrs)
    new_attrs['layout'] = 'NHWC'
    return relay.image.resize(data[0], **new_attrs)

And then you will also have to add image.resize to the layout transformation pass to make sure that the image.resize operation will be executed in NHWC and therefore can be included in the vitis_ai_0 function as well.

desired_layouts = {'nn.conv2d': ['NHWC', 'default'], 'image.resize': ['NHWC']}

seq = tvm.transform.Sequential([relay.transform.RemoveUnusedFunctions(),
                                relay.transform.ConvertLayout(desired_layouts),
                                relay.transform.FoldConstant()])
with tvm.transform.PassContext(opt_level=3):
    mod = seq(mod)

abdulazizm commented 3 years ago

@jtuyls Thanks for the reply, Jorn. I was using the desired layout for nn.conv2d as 'OIHW' instead of 'default'. Tried with the suggestions for image.resize, but still getting the same mod['main'] and inference time (image.resize at mod['main']). Not sure why, are we missing something?

1, desired_layouts = {'nn.conv2d': ['NHWC', 'default']} -> default recommendation (has 157 entries in mod['main']) and also yields assertion error from tvm

File "/home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/graph/ops/l2_convolution.py", line 182, in conv2d assert "Constant" in weights_layer.type TVMError: AssertionError

2, desired_layouts = {'nn.conv2d': ['NHWC', 'OIHW']} -> was using this before your suggestion regarding resize, this moves 120 layers inside vitis_ai_0 function (has only 37 entries in mod['main']) - Hope this should be ok 3, desired_layouts = {'nn.conv2d': ['NHWC', 'default'], 'image.resize': ['NHWC']} -> this results the same mod['main'] and assertion error as (1) 4, desired_layouts = {'nn.conv2d': ['NHWC', 'OIHW'], 'image.resize': ['NHWC']} -> this results the same mod['main'] as (2)

FYI: I am not using the current master branch of Pyxir and TVM. I am in "dev-rf-test-0" pyxir branch commit (485b7c1dfcfa863cdb715a0ba6125b10499e58e2). Hope this should not be an issue.

jtuyls commented 3 years ago

@abdulazizm Using 'OIHW' is fine. With 'default' you will probably have to add this line mod["main"] = bind_params_by_name(mod["main"], params) before the transformation to avoid transposes on the weight parameters (we are looking at resolving these kind of issues and abstracting all these transformations away inside one partitioning function).

So, approach 4 desired_layouts = {'nn.conv2d': ['NHWC', 'OIHW'], 'image.resize': ['NHWC']} is what I expected would work. This didn't change anything in the resulting TVM module? And you still see image resize operation in NCHW layout instead of NHWC?

Before, image.resize looked like this:

%8 = image.resize(%7, size=[224, 224]) /* ty=Tensor[(1, 21, 224, 224), float32] */;

If not being included in the vitis_ai_0 function, the image.resize function should at least have NHWC layout after the transformation:

%8 = image.resize(%7, size=[224, 224]) /* ty=Tensor[(1,224, 224, 21), float32] */;

If this is not the case, then there is an issue with the NCHW -> NHWC transformation.

abdulazizm commented 3 years ago

@jtuyls Yes, the mod['main'] seems to be the same before and after image.resize change. And I can also notice that the convert_image_resize() function is not called by any means while compilation (is it not registered properly?). While getting into TVM code, it seems that it has 'NCHW' as the default layout (will rebuilding TVM with default layout as "NHWC" helps?).

Guess you are right, there is an issue with the NCHW -> NHWC transformation.

jtuyls commented 3 years ago

@abdulazizm Found out that we also needed to add a InferCorrectLayout function and created a TVM PR for this: https://github.com/apache/tvm/pull/8205

abdulazizm commented 3 years ago

@jtuyls Yeah sure Jorn. Thanks for creating a PR and pushing the feature. Please let me know once it's merged to the master branch, will give you a shot and let you know the inference benchmark. I just started exploring the petalinux workflow, will update if I struck somewhere.

abdulazizm commented 3 years ago

@jtuyls Hi Jorn, This is on petalinux based support. Seems deeplabv3 model has 3 subgraphs. How to load models with such subgraphs greater than 1? Output as such:

root@xilinx-zcu104-2020_2:~/workspace/scripts/deeplabv3_resnet101# python3 run_pytorch_deeplabv3_resnet101_zynq_fps.py -f "" -t 3 --nb_tvm_threads 1
File /home/root/.tvm_test_data/data/cat.png exists, skip.
File /home/root/.tvm_test_data/data/imagenet1000_clsid_to_human.txt exists, skip.
transform_image_torchvision_NCHW torch.Size([1, 3, 224, 224])
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0614 08:36:17.361477  2907 dpu_func.cpp:57] Check failed: subgraph_.size() == 1u (3 vs. 1) model should have one and only one dpu subgraph.
*** Check failure stack trace: ***
Aborted
root@xilinx-zcu104-2020_2:~/workspace/scripts/deeplabv3_resnet101#

Noticed some workaround here for CPP with VART (https://github.com/Xilinx/Vitis-AI/issues/153). Not sure how to get it down with Python - TVM

abdulazizm commented 3 years ago

Hi @jornt-xilinx ,

Tested your suggestion of changing image.resize2d layout (NHWC to NCHW) with recent TVM and PYXIR versions, it eliminated most layout_transforms in mod["main"], but couldn't make those 3 conv2d layers into mod['vitis_ai_0']. Are there any further suggestions w.r.t. these conv2d layers?

mod["main"]

fn (%data: Tensor[(1, 3, 224, 224), float32]) -> (Tensor[(1, 21, 224, 224), float32], Tensor[(1, 21, 224, 224), float32]) {
  %0 = layout_transform(%data, src_layout="NCHW", dst_layout="NHWC") /* ty=Tensor[(1, 224, 224, 3), float32] */;
  %1 = @tvmgen_default_vitis_ai_0(%0) /* ty=(Tensor[(1, 28, 28, 21), float32], Tensor[(1, 28, 28, 256), float32], Tensor[(1, 28, 28, 256), float32], Tensor[(1, 28, 28, 256), float32], Tensor[(1, 28, 28, 256), float32], Tensor[(1, 27, 27, 256), float32]) */;
  %2 = %1.0;
  %3 = image.resize2d(%2, size=[224, 224], layout="NHWC", rounding_method="") /* ty=Tensor[(1, 224, 224, 21), float32] */;
  %4 = %1.5;
  %5 = %1.1;
  %6 = %1.2;
  %7 = %1.3;
  %8 = %1.4;
  %9 = image.resize2d(%4, size=[28, 28], layout="NHWC", rounding_method="") /* ty=Tensor[(1, 28, 28, 256), float32] */;
  %10 = (%5, %6, %7, %8, %9);
  %11 = concatenate(%10, axis=3) /* ty=Tensor[(1, 28, 28, 1280), float32] */;
  %12 = nn.conv2d(%11, meta[relay.Constant][0] /* ty=Tensor[(1, 1, 1280, 256), float32] */, padding=[0, 0, 0, 0], channels=256, kernel_size=[1, 1], data_layout="NHWC", kernel_layout="HWIO") /* ty=Tensor[(1, 28, 28, 256), float32] */;
  %13 = nn.batch_norm(%12, meta[relay.Constant][1] /* ty=Tensor[(256), float32] */, meta[relay.Constant][2] /* ty=Tensor[(256), float32] */, meta[relay.Constant][3] /* ty=Tensor[(256), float32] */, meta[relay.Constant][4] /* ty=Tensor[(256), float32] */, axis=3) /* ty=(Tensor[(1, 28, 28, 256), float32], Tensor[(256), float32], Tensor[(256), float32]) */;
  %14 = %13.0;
  %15 = nn.relu(%14) /* ty=Tensor[(1, 28, 28, 256), float32] */;
  %16 = nn.conv2d(%15, meta[relay.Constant][5] /* ty=Tensor[(3, 3, 256, 256), float32] */, padding=[1, 1, 1, 1], channels=256, kernel_size=[3, 3], data_layout="NHWC", kernel_layout="HWIO") /* ty=Tensor[(1, 28, 28, 256), float32] */;
  %17 = nn.batch_norm(%16, meta[relay.Constant][6] /* ty=Tensor[(256), float32] */, meta[relay.Constant][7] /* ty=Tensor[(256), float32] */, meta[relay.Constant][8] /* ty=Tensor[(256), float32] */, meta[relay.Constant][9] /* ty=Tensor[(256), float32] */, axis=3) /* ty=(Tensor[(1, 28, 28, 256), float32], Tensor[(256), float32], Tensor[(256), float32]) */;
  %18 = %17.0;
  %19 = nn.relu(%18) /* ty=Tensor[(1, 28, 28, 256), float32] */;
  %20 = nn.conv2d(%19, meta[relay.Constant][10] /* ty=Tensor[(1, 1, 256, 21), float32] */, padding=[0, 0, 0, 0], channels=21, kernel_size=[1, 1], data_layout="NHWC", kernel_layout="HWIO") /* ty=Tensor[(1, 28, 28, 21), float32] */;
  %21 = add(%20, meta[relay.Constant][11] /* ty=Tensor[(1, 1, 1, 21), float32] */) /* ty=Tensor[(1, 28, 28, 21), float32] */;
  %22 = image.resize2d(%21, size=[224, 224], layout="NHWC", rounding_method="") /* ty=Tensor[(1, 224, 224, 21), float32] */;
  %23 = layout_transform(%3, src_layout="NHWC", dst_layout="NCHW") /* ty=Tensor[(1, 21, 224, 224), float32] */;
  %24 = layout_transform(%22, src_layout="NHWC", dst_layout="NCHW") /* ty=Tensor[(1, 21, 224, 224), float32] */;
  (%23, %24)
}

jornt-xilinx commented 3 years ago

@abdulazizm I think those three Conv2D operations are not included in the Vitis AI partition because the image.resize operation has been renamed recently to image.resize2d, which isn't recognized here. I created a PR: https://github.com/Xilinx/pyxir/pull/59 to add support for the image.resize2d operation. Could you try if it allows you to include all Conv2D operations into mod['vitis_ai_0']?