Open Temerson0 opened 2 years ago
Here is the output of the vai_c compile just before the error:
CODE add_10/add conv2d-eltwise
Namespace(absolutely=4, address=None, avgpool_as_convolution=True, backwardcut=None, batchynormy=False, biaspatch=True, bilinear_as_nearest_avgpool=True, caffemodel=None, changeofpadding=True, circus=False, deephideconvolution=True, depthwiseasfull=True, dmem=False, downsample_fusion=True, fc=False, final=False, firstlayerreshape=True, forwardcut=None, framework='caffe', inner_as_convolution=True, inshapes='[4,224,224,3]', json=None, lcsoftwarepipeline=True, maxpoolconcatissues=True, network=None, nocpu=True, operation_fusion=True, operation_fusion_elt=True, operation_fusion_pool_conv=True, optimalavgpool=False, output=None, parallelismgraphalgorithm=None, parallelismstrategy="['bottom','top']", params='param', prefetch=True, quant=None, skip=False, softwarepipeline=True, uppy_fusion=True)
BATCH IN Shape [4, 64, 64, 128] Heights [10, 8, 8, 8, 8, 8, 8, 7]
BATCH TMP Shape [4, 64, 64, 128] height [8, 8, 8, 8, 8, 8, 8, 8]
BATCH OUT Shape [4, 64, 64, 128] Heights [8, 8, 8, 8, 8, 8, 8, 8]
A
0
0
0
the memory allocation MUST BE DONE BETTER RESHAPE
Name FeatureMapBuffer Size 20971520
Banks [
Name 0 Layout 2 Unit 8-bits Frequency 300MHz Rows 8192 Column 128 dict_keys([])
Name 1 Layout 2 Unit 8-bits Frequency 300MHz Rows 8192 Column 128 dict_keys([])
Name 2 Layout 2 Unit 8-bits Frequency 300MHz Rows 4096 Column 128 dict_keys([])
]
Name conv2d_38/Conv2D Type conv2d Composed []
Pred ['conv2d_37/Conv2D']
Succ ['conv2d_40/Conv2D', 'conv2d_39/Conv2D']```
@Temerson0 would you be willing to share the model artifacts? Frozen pb, quantized model.
@woinck could you please take a look at this with Paolo when you have time?
@bryanloz-xilinx @woinck I can share the artifacts. What is the best way to do send you these? I've used EZMove in the past.
@bryanloz-xilinx @woinck I can share the artifacts. What is the best way to do send you these? I've used EZMove in the past.
EZMove still works, you can send to me, and I can make sure the right people get it.
bryanloz@xilinx.com
OK, let me take a look at the model.
this is a little awkward because I can compile it in isolation using 2.0 ( compiler alone)
`
LOADFM 3602 221825024 14736.32921810776
LOADW 372 48732768 3133.5802469135906
LOADTH 0 0 0
SAVE 2944 173801472 13872.460905350283
CONV 918 256020316160 54219.216494845175
ELEW 1024 29097984 3190.5684210526074
POOL 0 0 0
Total Time = 65321.44 us (FPS = 244.94) `
and I know this will raise more questions but HW people create awesome plots
this is to show that the computation is done and verified by HW. but let me see if I can take a look at the layer failing for you ...
CODE add_9/add conv2d-eltwise
Namespace(absolutely=0, address=None, avgpool_as_convolution=True, backwardcut=None, batchynormy=False, biaspatch=False, bilinear_as_nearest_avgpool=True, caffemodel=None, changeofpaddin\
g=True, circus=False, deephideconvolution='True', depthwiseasfull=True, dmem=False, downsample_fusion=True, fc=False, final=False, firstlayerreshape='True', forwardcut=None, framework='x\
model', inner_as_convolution=True, inshapes='[4,224,224,3]', json=None, lcsoftwarepipeline='True', maxpoolconcatissues=False, network='examples/external/yolo4.xmodel', nocpu=True, operat\
ion_fusion='True', operation_fusion_elt='True', operation_fusion_pool_conv='True', optimalavgpool=False, output='work/out.asm', parallelismgraphalgorithm=None, parallelismstrategy="['bot\
tom','top']", params=None, prefetch='True', quant=None, skip=False, softwarepipeline='True', subasadd=True, uppy_fusion='True')
BATCH IN Shape [4, 64, 64, 128] Heights [10, 8, 8, 8, 8, 8, 8, 7]
BATCH TMP Shape [4, 64, 64, 128] height [8, 8, 8, 8, 8, 8, 8, 8]
BATCH OUT Shape [4, 64, 64, 128] Heights [8, 8, 8, 8, 8, 8, 8, 8]
A
0
0
0
CODE add_10/add conv2d-eltwise
Namespace(absolutely=0, address=None, avgpool_as_convolution=True, backwardcut=None, batchynormy=False, biaspatch=False, bilinear_as_nearest_avgpool=True, caffemodel=None, changeofpaddin\
g=True, circus=False, deephideconvolution='True', depthwiseasfull=True, dmem=False, downsample_fusion=True, fc=False, final=False, firstlayerreshape='True', forwardcut=None, framework='x\
model', inner_as_convolution=True, inshapes='[4,224,224,3]', json=None, lcsoftwarepipeline='True', maxpoolconcatissues=False, network='examples/external/yolo4.xmodel', nocpu=True, operat\
ion_fusion='True', operation_fusion_elt='True', operation_fusion_pool_conv='True', optimalavgpool=False, output='work/out.asm', parallelismgraphalgorithm=None, parallelismstrategy="['bot\
tom','top']", params=None, prefetch='True', quant=None, skip=False, softwarepipeline='True', subasadd=True, uppy_fusion='True')
BATCH IN Shape [4, 64, 64, 128] Heights [10, 8, 8, 8, 8, 8, 8, 7]
BATCH TMP Shape [4, 64, 64, 128] height [8, 8, 8, 8, 8, 8, 8, 8]
BATCH OUT Shape [4, 64, 64, 128] Heights [8, 8, 8, 8, 8, 8, 8, 8]
A
0
0
0
CODE conv2d_86/Conv2D conv2d
Namespace(absolutely=0, address=None, avgpool_as_convolution=True, backwardcut=None, batchynormy=False, biaspatch=False, bilinear_as_nearest_avgpool=True, caffemodel=None, changeofpaddin\
g=True, circus=False, deephideconvolution='True', depthwiseasfull=True, dmem=False, downsample_fusion=True, fc=False, final=False, firstlayerreshape='True', forwardcut=None, framework='x\
model', inner_as_convolution=True, inshapes='[4,224,224,3]', json=None, lcsoftwarepipeline='True', maxpoolconcatissues=False, network='examples/external/yolo4.xmodel', nocpu=True, operat\
ion_fusion='True', operation_fusion_elt='True', operation_fusion_pool_conv='True', optimalavgpool=False, output='work/out.asm', parallelismgraphalgorithm=None, parallelismstrategy="['bot\
tom','top']", params=None, prefetch='True', quant=None, skip=False, softwarepipeline='True', subasadd=True, uppy_fusion='True')
BATCH IN Shape [4, 64, 64, 256] Heights [8, 8, 8, 8, 8, 8, 8, 8]
BATCH OUT Shape [4, 64, 64, 128] Heights [8, 8, 8, 8, 8, 8, 8, 8]
0
0
representative 2 valid True
The memory failure for you @Temerson0 should stop at a pdb.trace (looking at the code and the assertion). but your memory seems "free".
I wonder how I can reproduce the error ... it may be somewhere else (not the same layer)
@Temerson0 what version of Vitis-AI did you use when you hit this error? It is possible that it was accidentally fixed when we released Vitis-AI 2.0.
Sorry, I now see the comment above where you tried it with both.
There still a opportunity for us to reproduce the error. Let me see what can we do
@bryanloz-xilinx @paolodalberto Yes I did try both versions (as you saw). Did you follow the compile instructions that I used with DPUCADF8H and U200. The input shape was also described as 'input_shape':'4,512,512,3'. I did try a compile with DPUCZDX8G and ZCU102 with 'input_shape':'1,512,512,3' which did complete, but didn't give me any of the similar output of the DPUCADF8H. So I'm not sure why the output of the two compiles is different.
@bryanloz-xilinx @paolodalberto Here is the output of the DPUCZDX8G compile, notice that this is all the output, where the DPUCADF8H had extensive output about each layer:
bash compile_yolov4.sh
**************************************************
* VITIS_AI Compilation - Xilinx Inc.
**************************************************
[INFO] Namespace(batchsize=1, inputs_shape=['1,512,512,3'], layout='NHWC', model_files=['./eai_yolov4_quantized/quantize_eval_model.pb'], model_type='tensorflow', named_inputs_shape=None, out_filename='/tmp/dpu_yolov4_org.xmodel', proto=None)
in_shapes: [[1, 512, 512, 3]]
[INFO] tensorflow model: quantize_eval_model.pb
[INFO] parse raw model :100%|████████████████████████████████████████████████████████████████████████████████████████████████| 532/532 [00:00<00:00, 13705.95it/s]
[INFO] infer shape (NHWC) :100%|████████████████████████████████████████████████████████████████████████████████████████████████| 427/427 [00:00<00:00, 468.14it/s]
[INFO] perform level-0 opt :100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 67.51it/s]
[INFO] perform level-1 opt :100%|████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 81.10it/s]
[INFO] generate xmodel :100%|████████████████████████████████████████████████████████████████████████████████████████████████| 421/421 [00:00<00:00, 509.88it/s]
[INFO] dump xmodel: /tmp/dpu_yolov4_org.xmodel
[UNILOG][INFO] Target architecture: DPUCZDX8G_ISA0_B4096_MAX_BG2
[UNILOG][INFO] Compile mode: dpu
[UNILOG][INFO] Debug mode: function
[UNILOG][INFO] Target architecture: DPUCZDX8G_ISA0_B4096_MAX_BG2
[UNILOG][INFO] Graph name: quantize_eval_model, with op num: 861
[UNILOG][INFO] Begin to compile...
[UNILOG][WARNING] xir::Op{name = max_pooling2d_1/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(9) is not in DPU supported range [1, 2]].
[UNILOG][WARNING] xir::Op{name = max_pooling2d_2/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(13) is not in DPU supported range [1, 2]].
[UNILOG][INFO] Total device subgraph number 8, DPU subgraph number 2
[UNILOG][INFO] Compile done.
[UNILOG][INFO] The meta json is saved to "./yolov4_compiled/meta.json"
[UNILOG][INFO] The compiled xmodel is saved to "./yolov4_compiled//dpu_yolov4.xmodel"
[UNILOG][INFO] The compiled xmodel's md5sum is 731e22d09874d7dacf44452855d27831, and has been saved to "/yolov4_compiled/md5sum.txt"
DPUCADF8H is the only architecture I can work with.
the input size is correct
#############################################
###### Parameters Assimilation: # 222
#############################################
0 data image_input Ops 0 Shape [4, 512, 512, 3] Frac 6 IN [] OUT ['image_input/aquant']
1 fix image_input/aquant Ops 0 Shape [4, 512, 512, 3] Frac 6 IN ['image_input'] OUT ['conv2d/Conv2D']
2 conv2d conv2d/Conv2D Ops 0 Shape [4, 512, 512, 32] Frac 6 IN ['image_input/aquant'] OUT ['leaky_re_lu/LeakyRelu']
3 leaky-relu leaky_re_lu/LeakyRelu Ops 0 Shape [4, 512, 512, 32] Frac 6 IN ['conv2d/Conv2D'] OUT ['leaky_re_lu/LeakyRelu/aquant']
4 fix leaky_re_lu/LeakyRelu/aquant Ops 0 Shape [4, 512, 512, 32] Frac 1 IN ['leaky_re_lu/LeakyRelu'] OUT ['zero_padding2d/Pad']
[UNILOG][WARNING] xir::Op{name = max_pooling2d_1/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(9) is not in DPU supported range [1, 2]].
[UNILOG][WARNING] xir::Op{name = max_pooling2d_2/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(13) is not in DPU supported range [1, 2]].
[UNILOG][INFO] Total device subgraph number 8, DPU subgraph number 2
The partition seems interesting: 2 DPU subgraphs I cannot see the layers
[UNILOG][WARNING] xir::Op{name = max_pooling2d_1/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(9) is not in DPU supported range [1, 2]].
[UNILOG][WARNING] xir::Op{name = max_pooling2d_2/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(13) is not in DPU supported range [1, 2]].
@paolodalberto Have you had any luck in reproducing the problem with DPUCADF8H?
We can reproduce a problem with this network in 2.0 @Temerson0 but it is not the same as your: My flow: TensorFlow -> xnnc -> xmodel -> compilation The resize layers are removed and thus everything following (thus I Can create code)
With the usual flow TensorFlow -> partitioner -> xcompiler -> xmodel
the resize layers are replaced with upsample (nearestsample) and there is a shift/scaling. The scaling is not acceptable and the process fail ... let me see if I can enforce the scaling differently
OK, I can compile the partitioner xmodel
Type Number Operations Duration
----------------------------------------------------------------------
LOADFM 3826 238077952 15781.399176955561
LOADW 524 73821024 4746.765432098778
LOADTH 0 0 0
SAVE 3152 180217856 14574.090534979923
CONV 1154 386681274368 81624.41237113455
ELEW 1312 32243712 3535.494736842082
POOL 0 0 0
----------------------------------------------------------------------
Total Time = 95098.29 us (FPS = 168.25)
better
let me see how can we share the compiler and you may try to run it .... @bryanloz-xilinx ?
@paolodalberto and @bryanloz-xilinx Thanks for digging in and figuring the issue out. Looking forward to testing your fix.
I will like to know how the story ends
most likely I broke the compiler somewhere else :)
@paolodalberto @bryanloz-xilinx I will test your fix as soon as you can share it with me.
I can reproduce the same error as you have @Temerson0 and the reason is fascinating. The operation schedule in the docker is different and it exposes a memory allocation flaws. Working on a patch
@paolodalberto Good to hear you reproduced the exact issue. I never would have thought that running the compilation under docker would produce different results than a native platform. Let me know when that patch is ready and I'll test it on my end.
@Temerson0 I sent you a the compiler by ezmove please untar in your docker and see if you can reinstall the compiler
(vitis-ai-tensorflow) Vitis-AI /workspace/dpuv3-pycompiler > pip install --user .
Processing /workspace/dpuv3-pycompiler
Preparing metadata (setup.py) ... done
Building wheels for collected packages: SC
Building wheel for SC (setup.py) ... done
Created wheel for SC: filename=SC-2.0-py3-none-any.whl size=212461 sha256=55b9099a7297947d402c1011ac9ecff167c46cd733d7de2bedef8547a0fc57d4
Stored in directory: /home/vitis-ai-user/.cache/pip/wheels/78/a6/b2/4d1e03a53652e171493a40cf9b0d2ad28e2ef11342fb2b912a
Successfully built SC
Installing collected packages: SC
WARNING: The scripts v3int8_compiler, v3int8_partitioner and v3int8_xcompiler are installed in '/home/vitis-ai-user/.local/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed SC-2.0
let me know if this hack will do anything for you
thank you
@paolodalberto I have good news and bad news. I was able to reinstall the compiler. The compile completed. I'm trying to run the model, but have run into a "CU Timeout" error. I'm currently trying to figure out if it's me causing the error. I may need to get the yolov4 tutorial's frozen .pb file. If one of us could test that model on your new compiler that may give us a working baseline. Thoughts?
At this time I cannot run anything on FPGA, for other reasons. Run time is not my forte. but let me ask ...
let me see if I can run this now ... where is the xmodel ?
the closest I can find is a caffe implementation 416x416 .. not really the same
I am attempting to compile a yolov4 model and it fails with the error in the title. Changes to the model were made according to the yolov4 tutorial. Quantization completed successfully. Compilation is being done for a U200 with DPUCADF8H, so the input_shape in the compile_yolov4.sh file was changed to 'input_shape':'4,512,512,3'.
The following script was run and generates the associated error message:
Note that the following allocation that causes the error returns
None
tq = FM.allocate(temporary) ## the memory allocation should account for this extra space in FM
I have tried this with vitis ai versions 1.4 and 2.0 with the same error. I'm not sure how to proceed, any help would be great.