Xilinx / Vitis-AI

Vitis AI is Xilinx’s development stack for AI inference on Xilinx hardware platforms, including both edge devices and Alveo cards.
https://www.xilinx.com/ai
Apache License 2.0
1.45k stars 626 forks source link

AssertionError: the memory allocation MUST BE DONE BETTER RESHAPE #664

Open Temerson0 opened 2 years ago

Temerson0 commented 2 years ago

I am attempting to compile a yolov4 model and it fails with the error in the title. Changes to the model were made according to the yolov4 tutorial. Quantization completed successfully. Compilation is being done for a U200 with DPUCADF8H, so the input_shape in the compile_yolov4.sh file was changed to 'input_shape':'4,512,512,3'.

The following script was run and generates the associated error message:

TARGET=U200
NET_NAME=dpu_yolov4
ARCH=/opt/vitis_ai/compiler/arch/${DPU}/${TARGET}/arch.json

vai_c_tensorflow --frozen_pb ./eai_yolov4_quantized/quantize_eval_model.pb \
                 --arch ${ARCH} \
                 --output_dir ./yolov4_compiled/ \
                 --net_name ${NET_NAME} \
                 --options "{'mode':'normal','save_kernel':'', 'input_shape':'4,512,512,3'}"```   

/opt/vitis_ai/conda/envs/vitis-ai-tensorflow/lib/python3.6/site-packages/SC/HwAbstraction/code_convreshape.py(296)gen_fm_par_fm()
-> assert tq is not None, " the memory allocation MUST BE DONE BETTER RESHAPE" + "\n" + str(FM)
(Pdb) continue
terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  AssertionError:  the memory allocation MUST BE DONE BETTER RESHAPE
Name FeatureMapBuffer Size 20971520 
   Banks [
    Name 0 Layout 2 Unit 8-bits Frequency 300MHz Rows 8192 Column 128 dict_keys([])
    Name 1 Layout 2 Unit 8-bits Frequency 300MHz Rows 8192 Column 128 dict_keys([])
    Name 2 Layout 2 Unit 8-bits Frequency 300MHz Rows 4096 Column 128 dict_keys([])
   ]

Note that the following allocation that causes the error returns None tq = FM.allocate(temporary) ## the memory allocation should account for this extra space in FM I have tried this with vitis ai versions 1.4 and 2.0 with the same error. I'm not sure how to proceed, any help would be great.

Temerson0 commented 2 years ago

Here is the output of the vai_c compile just before the error:


CODE add_10/add conv2d-eltwise
Namespace(absolutely=4, address=None, avgpool_as_convolution=True, backwardcut=None, batchynormy=False, biaspatch=True, bilinear_as_nearest_avgpool=True, caffemodel=None, changeofpadding=True, circus=False, deephideconvolution=True, depthwiseasfull=True, dmem=False, downsample_fusion=True, fc=False, final=False, firstlayerreshape=True, forwardcut=None, framework='caffe', inner_as_convolution=True, inshapes='[4,224,224,3]', json=None, lcsoftwarepipeline=True, maxpoolconcatissues=True, network=None, nocpu=True, operation_fusion=True, operation_fusion_elt=True, operation_fusion_pool_conv=True, optimalavgpool=False, output=None, parallelismgraphalgorithm=None, parallelismstrategy="['bottom','top']", params='param', prefetch=True, quant=None, skip=False, softwarepipeline=True, uppy_fusion=True)
BATCH IN  Shape [4, 64, 64, 128] Heights [10, 8, 8, 8, 8, 8, 8, 7] 
BATCH TMP Shape [4, 64, 64, 128] height [8, 8, 8, 8, 8, 8, 8, 8] 
BATCH OUT Shape [4, 64, 64, 128] Heights [8, 8, 8, 8, 8, 8, 8, 8] 
A
0
0
0

 the memory allocation MUST BE DONE BETTER RESHAPE
Name FeatureMapBuffer Size 20971520 
   Banks [
    Name 0 Layout 2 Unit 8-bits Frequency 300MHz Rows 8192 Column 128 dict_keys([])
    Name 1 Layout 2 Unit 8-bits Frequency 300MHz Rows 8192 Column 128 dict_keys([])
    Name 2 Layout 2 Unit 8-bits Frequency 300MHz Rows 4096 Column 128 dict_keys([])
   ]
Name conv2d_38/Conv2D Type conv2d Composed [] 
    Pred ['conv2d_37/Conv2D'] 
    Succ ['conv2d_40/Conv2D', 'conv2d_39/Conv2D']```
bryanloz-xilinx commented 2 years ago

@Temerson0 would you be willing to share the model artifacts? Frozen pb, quantized model.

@woinck could you please take a look at this with Paolo when you have time?

Temerson0 commented 2 years ago

@bryanloz-xilinx @woinck I can share the artifacts. What is the best way to do send you these? I've used EZMove in the past.

bryanloz-xilinx commented 2 years ago

@bryanloz-xilinx @woinck I can share the artifacts. What is the best way to do send you these? I've used EZMove in the past.

EZMove still works, you can send to me, and I can make sure the right people get it.

bryanloz@xilinx.com

paolodalberto commented 2 years ago

OK, let me take a look at the model.

paolodalberto commented 2 years ago

this is a little awkward because I can compile it in isolation using 2.0 ( compiler alone)

` LOADFM 3602 221825024 14736.32921810776
LOADW 372 48732768 3133.5802469135906
LOADTH 0 0 0
SAVE 2944 173801472 13872.460905350283
CONV 918 256020316160 54219.216494845175
ELEW 1024 29097984 3190.5684210526074
POOL 0 0 0

Total Time = 65321.44 us (FPS = 244.94) `

paolodalberto commented 2 years ago

and I know this will raise more questions but HW people create awesome plots

yolov4

this is to show that the computation is done and verified by HW. but let me see if I can take a look at the layer failing for you ...

paolodalberto commented 2 years ago
CODE add_9/add conv2d-eltwise
Namespace(absolutely=0, address=None, avgpool_as_convolution=True, backwardcut=None, batchynormy=False, biaspatch=False, bilinear_as_nearest_avgpool=True, caffemodel=None, changeofpaddin\
g=True, circus=False, deephideconvolution='True', depthwiseasfull=True, dmem=False, downsample_fusion=True, fc=False, final=False, firstlayerreshape='True', forwardcut=None, framework='x\
model', inner_as_convolution=True, inshapes='[4,224,224,3]', json=None, lcsoftwarepipeline='True', maxpoolconcatissues=False, network='examples/external/yolo4.xmodel', nocpu=True, operat\
ion_fusion='True', operation_fusion_elt='True', operation_fusion_pool_conv='True', optimalavgpool=False, output='work/out.asm', parallelismgraphalgorithm=None, parallelismstrategy="['bot\
tom','top']", params=None, prefetch='True', quant=None, skip=False, softwarepipeline='True', subasadd=True, uppy_fusion='True')
BATCH IN  Shape [4, 64, 64, 128] Heights [10, 8, 8, 8, 8, 8, 8, 7]
BATCH TMP Shape [4, 64, 64, 128] height [8, 8, 8, 8, 8, 8, 8, 8]
BATCH OUT Shape [4, 64, 64, 128] Heights [8, 8, 8, 8, 8, 8, 8, 8]
A
0
0
0
CODE add_10/add conv2d-eltwise
Namespace(absolutely=0, address=None, avgpool_as_convolution=True, backwardcut=None, batchynormy=False, biaspatch=False, bilinear_as_nearest_avgpool=True, caffemodel=None, changeofpaddin\
g=True, circus=False, deephideconvolution='True', depthwiseasfull=True, dmem=False, downsample_fusion=True, fc=False, final=False, firstlayerreshape='True', forwardcut=None, framework='x\
model', inner_as_convolution=True, inshapes='[4,224,224,3]', json=None, lcsoftwarepipeline='True', maxpoolconcatissues=False, network='examples/external/yolo4.xmodel', nocpu=True, operat\
ion_fusion='True', operation_fusion_elt='True', operation_fusion_pool_conv='True', optimalavgpool=False, output='work/out.asm', parallelismgraphalgorithm=None, parallelismstrategy="['bot\
tom','top']", params=None, prefetch='True', quant=None, skip=False, softwarepipeline='True', subasadd=True, uppy_fusion='True')
BATCH IN  Shape [4, 64, 64, 128] Heights [10, 8, 8, 8, 8, 8, 8, 7]
BATCH TMP Shape [4, 64, 64, 128] height [8, 8, 8, 8, 8, 8, 8, 8]
BATCH OUT Shape [4, 64, 64, 128] Heights [8, 8, 8, 8, 8, 8, 8, 8]
A
0
0
0
CODE conv2d_86/Conv2D conv2d
Namespace(absolutely=0, address=None, avgpool_as_convolution=True, backwardcut=None, batchynormy=False, biaspatch=False, bilinear_as_nearest_avgpool=True, caffemodel=None, changeofpaddin\
g=True, circus=False, deephideconvolution='True', depthwiseasfull=True, dmem=False, downsample_fusion=True, fc=False, final=False, firstlayerreshape='True', forwardcut=None, framework='x\
model', inner_as_convolution=True, inshapes='[4,224,224,3]', json=None, lcsoftwarepipeline='True', maxpoolconcatissues=False, network='examples/external/yolo4.xmodel', nocpu=True, operat\
ion_fusion='True', operation_fusion_elt='True', operation_fusion_pool_conv='True', optimalavgpool=False, output='work/out.asm', parallelismgraphalgorithm=None, parallelismstrategy="['bot\
tom','top']", params=None, prefetch='True', quant=None, skip=False, softwarepipeline='True', subasadd=True, uppy_fusion='True')
BATCH IN  Shape [4, 64, 64, 256] Heights [8, 8, 8, 8, 8, 8, 8, 8]
BATCH OUT Shape [4, 64, 64, 128] Heights [8, 8, 8, 8, 8, 8, 8, 8]
0
0
representative 2 valid True
paolodalberto commented 2 years ago

The memory failure for you @Temerson0 should stop at a pdb.trace (looking at the code and the assertion). but your memory seems "free".

paolodalberto commented 2 years ago

I wonder how I can reproduce the error ... it may be somewhere else (not the same layer)

bryanloz-xilinx commented 2 years ago

@Temerson0 what version of Vitis-AI did you use when you hit this error? It is possible that it was accidentally fixed when we released Vitis-AI 2.0.

Sorry, I now see the comment above where you tried it with both.

paolodalberto commented 2 years ago

There still a opportunity for us to reproduce the error. Let me see what can we do

Temerson0 commented 2 years ago

@bryanloz-xilinx @paolodalberto Yes I did try both versions (as you saw). Did you follow the compile instructions that I used with DPUCADF8H and U200. The input shape was also described as 'input_shape':'4,512,512,3'. I did try a compile with DPUCZDX8G and ZCU102 with 'input_shape':'1,512,512,3' which did complete, but didn't give me any of the similar output of the DPUCADF8H. So I'm not sure why the output of the two compiles is different.

Temerson0 commented 2 years ago

@bryanloz-xilinx @paolodalberto Here is the output of the DPUCZDX8G compile, notice that this is all the output, where the DPUCADF8H had extensive output about each layer:

bash compile_yolov4.sh 
**************************************************
* VITIS_AI Compilation - Xilinx Inc.
**************************************************
[INFO] Namespace(batchsize=1, inputs_shape=['1,512,512,3'], layout='NHWC', model_files=['./eai_yolov4_quantized/quantize_eval_model.pb'], model_type='tensorflow', named_inputs_shape=None, out_filename='/tmp/dpu_yolov4_org.xmodel', proto=None)
in_shapes: [[1, 512, 512, 3]]
[INFO] tensorflow model: quantize_eval_model.pb
[INFO] parse raw model     :100%|████████████████████████████████████████████████████████████████████████████████████████████████| 532/532 [00:00<00:00, 13705.95it/s]             
[INFO] infer shape (NHWC)  :100%|████████████████████████████████████████████████████████████████████████████████████████████████| 427/427 [00:00<00:00, 468.14it/s]               
[INFO] perform level-0 opt :100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 67.51it/s]                    
[INFO] perform level-1 opt :100%|████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 81.10it/s]                    
[INFO] generate xmodel     :100%|████████████████████████████████████████████████████████████████████████████████████████████████| 421/421 [00:00<00:00, 509.88it/s]               
[INFO] dump xmodel: /tmp/dpu_yolov4_org.xmodel
[UNILOG][INFO] Target architecture: DPUCZDX8G_ISA0_B4096_MAX_BG2
[UNILOG][INFO] Compile mode: dpu
[UNILOG][INFO] Debug mode: function
[UNILOG][INFO] Target architecture: DPUCZDX8G_ISA0_B4096_MAX_BG2
[UNILOG][INFO] Graph name: quantize_eval_model, with op num: 861
[UNILOG][INFO] Begin to compile...
[UNILOG][WARNING] xir::Op{name = max_pooling2d_1/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(9) is not in DPU supported range [1, 2]].
[UNILOG][WARNING] xir::Op{name = max_pooling2d_2/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(13) is not in DPU supported range [1, 2]].
[UNILOG][INFO] Total device subgraph number 8, DPU subgraph number 2
[UNILOG][INFO] Compile done.
[UNILOG][INFO] The meta json is saved to "./yolov4_compiled/meta.json"
[UNILOG][INFO] The compiled xmodel is saved to "./yolov4_compiled//dpu_yolov4.xmodel"
[UNILOG][INFO] The compiled xmodel's md5sum is 731e22d09874d7dacf44452855d27831, and has been saved to "/yolov4_compiled/md5sum.txt"
paolodalberto commented 2 years ago

DPUCADF8H is the only architecture I can work with.

the input size is correct

 #############################################
 ######  Parameters Assimilation: # 222
 #############################################
  0 data       image_input Ops 0 Shape [4, 512, 512, 3] Frac 6 IN [] OUT ['image_input/aquant']
  1 fix        image_input/aquant Ops 0 Shape [4, 512, 512, 3] Frac 6 IN ['image_input'] OUT ['conv2d/Conv2D']
  2 conv2d     conv2d/Conv2D Ops 0 Shape [4, 512, 512, 32] Frac 6 IN ['image_input/aquant'] OUT ['leaky_re_lu/LeakyRelu']
  3 leaky-relu leaky_re_lu/LeakyRelu Ops 0 Shape [4, 512, 512, 32] Frac 6 IN ['conv2d/Conv2D'] OUT ['leaky_re_lu/LeakyRelu/aquant']
  4 fix        leaky_re_lu/LeakyRelu/aquant Ops 0 Shape [4, 512, 512, 32] Frac 1 IN ['leaky_re_lu/LeakyRelu'] OUT ['zero_padding2d/Pad']
paolodalberto commented 2 years ago
[UNILOG][WARNING] xir::Op{name = max_pooling2d_1/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(9) is not in DPU supported range [1, 2]].
[UNILOG][WARNING] xir::Op{name = max_pooling2d_2/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(13) is not in DPU supported range [1, 2]].
[UNILOG][INFO] Total device subgraph number 8, DPU subgraph number 2

The partition seems interesting: 2 DPU subgraphs I cannot see the layers

[UNILOG][WARNING] xir::Op{name = max_pooling2d_1/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(9) is not in DPU supported range [1, 2]].
[UNILOG][WARNING] xir::Op{name = max_pooling2d_2/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(13) is not in DPU supported range [1, 2]].
Temerson0 commented 2 years ago

@paolodalberto Have you had any luck in reproducing the problem with DPUCADF8H?

paolodalberto commented 2 years ago

We can reproduce a problem with this network in 2.0 @Temerson0 but it is not the same as your: My flow: TensorFlow -> xnnc -> xmodel -> compilation The resize layers are removed and thus everything following (thus I Can create code)

With the usual flow TensorFlow -> partitioner -> xcompiler -> xmodel

the resize layers are replaced with upsample (nearestsample) and there is a shift/scaling. The scaling is not acceptable and the process fail ... let me see if I can enforce the scaling differently

paolodalberto commented 2 years ago

OK, I can compile the partitioner xmodel

Type      Number              Operations          Duration            
----------------------------------------------------------------------
LOADFM    3826                238077952           15781.399176955561  
LOADW     524                 73821024            4746.765432098778   
LOADTH    0                   0                   0                   
SAVE      3152                180217856           14574.090534979923  
CONV      1154                386681274368        81624.41237113455   
ELEW      1312                32243712            3535.494736842082   
POOL      0                   0                   0                   
----------------------------------------------------------------------
Total Time = 95098.29 us (FPS = 168.25)
paolodalberto commented 2 years ago
yolov4

better

paolodalberto commented 2 years ago

let me see how can we share the compiler and you may try to run it .... @bryanloz-xilinx ?

Temerson0 commented 2 years ago

@paolodalberto and @bryanloz-xilinx Thanks for digging in and figuring the issue out. Looking forward to testing your fix.

paolodalberto commented 2 years ago

I will like to know how the story ends

paolodalberto commented 2 years ago

most likely I broke the compiler somewhere else :)

Temerson0 commented 2 years ago

@paolodalberto @bryanloz-xilinx I will test your fix as soon as you can share it with me.

paolodalberto commented 2 years ago

I can reproduce the same error as you have @Temerson0 and the reason is fascinating. The operation schedule in the docker is different and it exposes a memory allocation flaws. Working on a patch

Temerson0 commented 2 years ago

@paolodalberto Good to hear you reproduced the exact issue. I never would have thought that running the compilation under docker would produce different results than a native platform. Let me know when that patch is ready and I'll test it on my end.

paolodalberto commented 2 years ago

@Temerson0 I sent you a the compiler by ezmove please untar in your docker and see if you can reinstall the compiler

(vitis-ai-tensorflow) Vitis-AI /workspace/dpuv3-pycompiler > pip install --user .
Processing /workspace/dpuv3-pycompiler
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: SC
  Building wheel for SC (setup.py) ... done
  Created wheel for SC: filename=SC-2.0-py3-none-any.whl size=212461 sha256=55b9099a7297947d402c1011ac9ecff167c46cd733d7de2bedef8547a0fc57d4
  Stored in directory: /home/vitis-ai-user/.cache/pip/wheels/78/a6/b2/4d1e03a53652e171493a40cf9b0d2ad28e2ef11342fb2b912a
Successfully built SC
Installing collected packages: SC
  WARNING: The scripts v3int8_compiler, v3int8_partitioner and v3int8_xcompiler are installed in '/home/vitis-ai-user/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed SC-2.0

let me know if this hack will do anything for you

thank you

Temerson0 commented 2 years ago

@paolodalberto I have good news and bad news. I was able to reinstall the compiler. The compile completed. I'm trying to run the model, but have run into a "CU Timeout" error. I'm currently trying to figure out if it's me causing the error. I may need to get the yolov4 tutorial's frozen .pb file. If one of us could test that model on your new compiler that may give us a working baseline. Thoughts?

paolodalberto commented 2 years ago

At this time I cannot run anything on FPGA, for other reasons. Run time is not my forte. but let me ask ...

paolodalberto commented 2 years ago

let me see if I can run this now ... where is the xmodel ?

paolodalberto commented 2 years ago

the closest I can find is a caffe implementation 416x416 .. not really the same