AssertionError: the memory allocation MUST BE DONE BETTER RESHAPE

Temerson0 commented 2 years ago

I am attempting to compile a yolov4 model and it fails with the error in the title. Changes to the model were made according to the yolov4 tutorial. Quantization completed successfully. Compilation is being done for a U200 with DPUCADF8H, so the input_shape in the compile_yolov4.sh file was changed to 'input_shape':'4,512,512,3'.

The following script was run and generates the associated error message:

TARGET=U200
NET_NAME=dpu_yolov4
ARCH=/opt/vitis_ai/compiler/arch/${DPU}/${TARGET}/arch.json

vai_c_tensorflow --frozen_pb ./eai_yolov4_quantized/quantize_eval_model.pb \
                 --arch ${ARCH} \
                 --output_dir ./yolov4_compiled/ \
                 --net_name ${NET_NAME} \
                 --options "{'mode':'normal','save_kernel':'', 'input_shape':'4,512,512,3'}"```   

/opt/vitis_ai/conda/envs/vitis-ai-tensorflow/lib/python3.6/site-packages/SC/HwAbstraction/code_convreshape.py(296)gen_fm_par_fm()
-> assert tq is not None, " the memory allocation MUST BE DONE BETTER RESHAPE" + "\n" + str(FM)
(Pdb) continue
terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  AssertionError:  the memory allocation MUST BE DONE BETTER RESHAPE
Name FeatureMapBuffer Size 20971520 
   Banks [
    Name 0 Layout 2 Unit 8-bits Frequency 300MHz Rows 8192 Column 128 dict_keys([])
    Name 1 Layout 2 Unit 8-bits Frequency 300MHz Rows 8192 Column 128 dict_keys([])
    Name 2 Layout 2 Unit 8-bits Frequency 300MHz Rows 4096 Column 128 dict_keys([])
   ]

Note that the following allocation that causes the error returns None tq = FM.allocate(temporary) ## the memory allocation should account for this extra space in FM I have tried this with vitis ai versions 1.4 and 2.0 with the same error. I'm not sure how to proceed, any help would be great.

Temerson0 commented 2 years ago

Here is the output of the vai_c compile just before the error:


CODE add_10/add conv2d-eltwise
Namespace(absolutely=4, address=None, avgpool_as_convolution=True, backwardcut=None, batchynormy=False, biaspatch=True, bilinear_as_nearest_avgpool=True, caffemodel=None, changeofpadding=True, circus=False, deephideconvolution=True, depthwiseasfull=True, dmem=False, downsample_fusion=True, fc=False, final=False, firstlayerreshape=True, forwardcut=None, framework='caffe', inner_as_convolution=True, inshapes='[4,224,224,3]', json=None, lcsoftwarepipeline=True, maxpoolconcatissues=True, network=None, nocpu=True, operation_fusion=True, operation_fusion_elt=True, operation_fusion_pool_conv=True, optimalavgpool=False, output=None, parallelismgraphalgorithm=None, parallelismstrategy="['bottom','top']", params='param', prefetch=True, quant=None, skip=False, softwarepipeline=True, uppy_fusion=True)
BATCH IN  Shape [4, 64, 64, 128] Heights [10, 8, 8, 8, 8, 8, 8, 7] 
BATCH TMP Shape [4, 64, 64, 128] height [8, 8, 8, 8, 8, 8, 8, 8] 
BATCH OUT Shape [4, 64, 64, 128] Heights [8, 8, 8, 8, 8, 8, 8, 8] 
A
0
0
0

 the memory allocation MUST BE DONE BETTER RESHAPE
Name FeatureMapBuffer Size 20971520 
   Banks [
    Name 0 Layout 2 Unit 8-bits Frequency 300MHz Rows 8192 Column 128 dict_keys([])
    Name 1 Layout 2 Unit 8-bits Frequency 300MHz Rows 8192 Column 128 dict_keys([])
    Name 2 Layout 2 Unit 8-bits Frequency 300MHz Rows 4096 Column 128 dict_keys([])
   ]
Name conv2d_38/Conv2D Type conv2d Composed [] 
    Pred ['conv2d_37/Conv2D'] 
    Succ ['conv2d_40/Conv2D', 'conv2d_39/Conv2D']```

bryanloz-xilinx commented 2 years ago

@Temerson0 would you be willing to share the model artifacts? Frozen pb, quantized model.

@woinck could you please take a look at this with Paolo when you have time?

Temerson0 commented 2 years ago

@bryanloz-xilinx @woinck I can share the artifacts. What is the best way to do send you these? I've used EZMove in the past.

bryanloz-xilinx commented 2 years ago

@bryanloz-xilinx @woinck I can share the artifacts. What is the best way to do send you these? I've used EZMove in the past.

EZMove still works, you can send to me, and I can make sure the right people get it.

bryanloz@xilinx.com

paolodalberto commented 2 years ago

OK, let me take a look at the model.

paolodalberto commented 2 years ago

this is a little awkward because I can compile it in isolation using 2.0 ( compiler alone)

` LOADFM 3602 221825024 14736.32921810776
LOADW 372 48732768 3133.5802469135906
LOADTH 0 0 0
SAVE 2944 173801472 13872.460905350283
CONV 918 256020316160 54219.216494845175
ELEW 1024 29097984 3190.5684210526074
POOL 0 0 0

Total Time = 65321.44 us (FPS = 244.94) `

paolodalberto commented 2 years ago

and I know this will raise more questions but HW people create awesome plots

this is to show that the computation is done and verified by HW. but let me see if I can take a look at the layer failing for you ...

paolodalberto commented 2 years ago

CODE add_9/add conv2d-eltwise
Namespace(absolutely=0, address=None, avgpool_as_convolution=True, backwardcut=None, batchynormy=False, biaspatch=False, bilinear_as_nearest_avgpool=True, caffemodel=None, changeofpaddin\
g=True, circus=False, deephideconvolution='True', depthwiseasfull=True, dmem=False, downsample_fusion=True, fc=False, final=False, firstlayerreshape='True', forwardcut=None, framework='x\
model', inner_as_convolution=True, inshapes='[4,224,224,3]', json=None, lcsoftwarepipeline='True', maxpoolconcatissues=False, network='examples/external/yolo4.xmodel', nocpu=True, operat\
ion_fusion='True', operation_fusion_elt='True', operation_fusion_pool_conv='True', optimalavgpool=False, output='work/out.asm', parallelismgraphalgorithm=None, parallelismstrategy="['bot\
tom','top']", params=None, prefetch='True', quant=None, skip=False, softwarepipeline='True', subasadd=True, uppy_fusion='True')
BATCH IN  Shape [4, 64, 64, 128] Heights [10, 8, 8, 8, 8, 8, 8, 7]
BATCH TMP Shape [4, 64, 64, 128] height [8, 8, 8, 8, 8, 8, 8, 8]
BATCH OUT Shape [4, 64, 64, 128] Heights [8, 8, 8, 8, 8, 8, 8, 8]
A
0
0
0
CODE add_10/add conv2d-eltwise
Namespace(absolutely=0, address=None, avgpool_as_convolution=True, backwardcut=None, batchynormy=False, biaspatch=False, bilinear_as_nearest_avgpool=True, caffemodel=None, changeofpaddin\
g=True, circus=False, deephideconvolution='True', depthwiseasfull=True, dmem=False, downsample_fusion=True, fc=False, final=False, firstlayerreshape='True', forwardcut=None, framework='x\
model', inner_as_convolution=True, inshapes='[4,224,224,3]', json=None, lcsoftwarepipeline='True', maxpoolconcatissues=False, network='examples/external/yolo4.xmodel', nocpu=True, operat\
ion_fusion='True', operation_fusion_elt='True', operation_fusion_pool_conv='True', optimalavgpool=False, output='work/out.asm', parallelismgraphalgorithm=None, parallelismstrategy="['bot\
tom','top']", params=None, prefetch='True', quant=None, skip=False, softwarepipeline='True', subasadd=True, uppy_fusion='True')
BATCH IN  Shape [4, 64, 64, 128] Heights [10, 8, 8, 8, 8, 8, 8, 7]
BATCH TMP Shape [4, 64, 64, 128] height [8, 8, 8, 8, 8, 8, 8, 8]
BATCH OUT Shape [4, 64, 64, 128] Heights [8, 8, 8, 8, 8, 8, 8, 8]
A
0
0
0
CODE conv2d_86/Conv2D conv2d
Namespace(absolutely=0, address=None, avgpool_as_convolution=True, backwardcut=None, batchynormy=False, biaspatch=False, bilinear_as_nearest_avgpool=True, caffemodel=None, changeofpaddin\
g=True, circus=False, deephideconvolution='True', depthwiseasfull=True, dmem=False, downsample_fusion=True, fc=False, final=False, firstlayerreshape='True', forwardcut=None, framework='x\
model', inner_as_convolution=True, inshapes='[4,224,224,3]', json=None, lcsoftwarepipeline='True', maxpoolconcatissues=False, network='examples/external/yolo4.xmodel', nocpu=True, operat\
ion_fusion='True', operation_fusion_elt='True', operation_fusion_pool_conv='True', optimalavgpool=False, output='work/out.asm', parallelismgraphalgorithm=None, parallelismstrategy="['bot\
tom','top']", params=None, prefetch='True', quant=None, skip=False, softwarepipeline='True', subasadd=True, uppy_fusion='True')
BATCH IN  Shape [4, 64, 64, 256] Heights [8, 8, 8, 8, 8, 8, 8, 8]
BATCH OUT Shape [4, 64, 64, 128] Heights [8, 8, 8, 8, 8, 8, 8, 8]
0
0
representative 2 valid True

paolodalberto commented 2 years ago

The memory failure for you @Temerson0 should stop at a pdb.trace (looking at the code and the assertion). but your memory seems "free".

paolodalberto commented 2 years ago

I wonder how I can reproduce the error ... it may be somewhere else (not the same layer)

bryanloz-xilinx commented 2 years ago

@Temerson0 what version of Vitis-AI did you use when you hit this error? It is possible that it was accidentally fixed when we released Vitis-AI 2.0.

Sorry, I now see the comment above where you tried it with both.

paolodalberto commented 2 years ago

There still a opportunity for us to reproduce the error. Let me see what can we do

Temerson0 commented 2 years ago

@bryanloz-xilinx @paolodalberto Yes I did try both versions (as you saw). Did you follow the compile instructions that I used with DPUCADF8H and U200. The input shape was also described as 'input_shape':'4,512,512,3'. I did try a compile with DPUCZDX8G and ZCU102 with 'input_shape':'1,512,512,3' which did complete, but didn't give me any of the similar output of the DPUCADF8H. So I'm not sure why the output of the two compiles is different.

Temerson0 commented 2 years ago

@bryanloz-xilinx @paolodalberto Here is the output of the DPUCZDX8G compile, notice that this is all the output, where the DPUCADF8H had extensive output about each layer:

bash compile_yolov4.sh 
**************************************************
* VITIS_AI Compilation - Xilinx Inc.
**************************************************
[INFO] Namespace(batchsize=1, inputs_shape=['1,512,512,3'], layout='NHWC', model_files=['./eai_yolov4_quantized/quantize_eval_model.pb'], model_type='tensorflow', named_inputs_shape=None, out_filename='/tmp/dpu_yolov4_org.xmodel', proto=None)
in_shapes: [[1, 512, 512, 3]]
[INFO] tensorflow model: quantize_eval_model.pb
[INFO] parse raw model     :100%|████████████████████████████████████████████████████████████████████████████████████████████████| 532/532 [00:00<00:00, 13705.95it/s]             
[INFO] infer shape (NHWC)  :100%|████████████████████████████████████████████████████████████████████████████████████████████████| 427/427 [00:00<00:00, 468.14it/s]               
[INFO] perform level-0 opt :100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 67.51it/s]                    
[INFO] perform level-1 opt :100%|████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 81.10it/s]                    
[INFO] generate xmodel     :100%|████████████████████████████████████████████████████████████████████████████████████████████████| 421/421 [00:00<00:00, 509.88it/s]               
[INFO] dump xmodel: /tmp/dpu_yolov4_org.xmodel
[UNILOG][INFO] Target architecture: DPUCZDX8G_ISA0_B4096_MAX_BG2
[UNILOG][INFO] Compile mode: dpu
[UNILOG][INFO] Debug mode: function
[UNILOG][INFO] Target architecture: DPUCZDX8G_ISA0_B4096_MAX_BG2
[UNILOG][INFO] Graph name: quantize_eval_model, with op num: 861
[UNILOG][INFO] Begin to compile...
[UNILOG][WARNING] xir::Op{name = max_pooling2d_1/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(9) is not in DPU supported range [1, 2]].
[UNILOG][WARNING] xir::Op{name = max_pooling2d_2/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(13) is not in DPU supported range [1, 2]].
[UNILOG][INFO] Total device subgraph number 8, DPU subgraph number 2
[UNILOG][INFO] Compile done.
[UNILOG][INFO] The meta json is saved to "./yolov4_compiled/meta.json"
[UNILOG][INFO] The compiled xmodel is saved to "./yolov4_compiled//dpu_yolov4.xmodel"
[UNILOG][INFO] The compiled xmodel's md5sum is 731e22d09874d7dacf44452855d27831, and has been saved to "/yolov4_compiled/md5sum.txt"

paolodalberto commented 2 years ago

DPUCADF8H is the only architecture I can work with.

the input size is correct

 #############################################
 ######  Parameters Assimilation: # 222
 #############################################
  0 data       image_input Ops 0 Shape [4, 512, 512, 3] Frac 6 IN [] OUT ['image_input/aquant']
  1 fix        image_input/aquant Ops 0 Shape [4, 512, 512, 3] Frac 6 IN ['image_input'] OUT ['conv2d/Conv2D']
  2 conv2d     conv2d/Conv2D Ops 0 Shape [4, 512, 512, 32] Frac 6 IN ['image_input/aquant'] OUT ['leaky_re_lu/LeakyRelu']
  3 leaky-relu leaky_re_lu/LeakyRelu Ops 0 Shape [4, 512, 512, 32] Frac 6 IN ['conv2d/Conv2D'] OUT ['leaky_re_lu/LeakyRelu/aquant']
  4 fix        leaky_re_lu/LeakyRelu/aquant Ops 0 Shape [4, 512, 512, 32] Frac 1 IN ['leaky_re_lu/LeakyRelu'] OUT ['zero_padding2d/Pad']

paolodalberto commented 2 years ago

[UNILOG][WARNING] xir::Op{name = max_pooling2d_1/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(9) is not in DPU supported range [1, 2]].
[UNILOG][WARNING] xir::Op{name = max_pooling2d_2/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(13) is not in DPU supported range [1, 2]].
[UNILOG][INFO] Total device subgraph number 8, DPU subgraph number 2

The partition seems interesting: 2 DPU subgraphs I cannot see the layers

[UNILOG][WARNING] xir::Op{name = max_pooling2d_1/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(9) is not in DPU supported range [1, 2]].
[UNILOG][WARNING] xir::Op{name = max_pooling2d_2/MaxPool, type = pool-fix} has been assigned to CPU: ["kernel_height(13) is not in DPU supported range [1, 2]].

Temerson0 commented 2 years ago

@paolodalberto Have you had any luck in reproducing the problem with DPUCADF8H?

paolodalberto commented 2 years ago

We can reproduce a problem with this network in 2.0 @Temerson0 but it is not the same as your: My flow: TensorFlow -> xnnc -> xmodel -> compilation The resize layers are removed and thus everything following (thus I Can create code)

With the usual flow TensorFlow -> partitioner -> xcompiler -> xmodel

the resize layers are replaced with upsample (nearestsample) and there is a shift/scaling. The scaling is not acceptable and the process fail ... let me see if I can enforce the scaling differently

paolodalberto commented 2 years ago

OK, I can compile the partitioner xmodel

Type      Number              Operations          Duration            
----------------------------------------------------------------------
LOADFM    3826                238077952           15781.399176955561  
LOADW     524                 73821024            4746.765432098778   
LOADTH    0                   0                   0                   
SAVE      3152                180217856           14574.090534979923  
CONV      1154                386681274368        81624.41237113455   
ELEW      1312                32243712            3535.494736842082   
POOL      0                   0                   0                   
----------------------------------------------------------------------
Total Time = 95098.29 us (FPS = 168.25)

paolodalberto commented 2 years ago

better

paolodalberto commented 2 years ago

let me see how can we share the compiler and you may try to run it .... @bryanloz-xilinx ?

Temerson0 commented 2 years ago

@paolodalberto and @bryanloz-xilinx Thanks for digging in and figuring the issue out. Looking forward to testing your fix.

paolodalberto commented 2 years ago

I will like to know how the story ends

paolodalberto commented 2 years ago

most likely I broke the compiler somewhere else :)

Temerson0 commented 2 years ago

@paolodalberto @bryanloz-xilinx I will test your fix as soon as you can share it with me.

paolodalberto commented 2 years ago

I can reproduce the same error as you have @Temerson0 and the reason is fascinating. The operation schedule in the docker is different and it exposes a memory allocation flaws. Working on a patch

Temerson0 commented 2 years ago

@paolodalberto Good to hear you reproduced the exact issue. I never would have thought that running the compilation under docker would produce different results than a native platform. Let me know when that patch is ready and I'll test it on my end.

paolodalberto commented 2 years ago

@Temerson0 I sent you a the compiler by ezmove please untar in your docker and see if you can reinstall the compiler

(vitis-ai-tensorflow) Vitis-AI /workspace/dpuv3-pycompiler > pip install --user .
Processing /workspace/dpuv3-pycompiler
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: SC
  Building wheel for SC (setup.py) ... done
  Created wheel for SC: filename=SC-2.0-py3-none-any.whl size=212461 sha256=55b9099a7297947d402c1011ac9ecff167c46cd733d7de2bedef8547a0fc57d4
  Stored in directory: /home/vitis-ai-user/.cache/pip/wheels/78/a6/b2/4d1e03a53652e171493a40cf9b0d2ad28e2ef11342fb2b912a
Successfully built SC
Installing collected packages: SC
  WARNING: The scripts v3int8_compiler, v3int8_partitioner and v3int8_xcompiler are installed in '/home/vitis-ai-user/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed SC-2.0

let me know if this hack will do anything for you

thank you

Temerson0 commented 2 years ago

@paolodalberto I have good news and bad news. I was able to reinstall the compiler. The compile completed. I'm trying to run the model, but have run into a "CU Timeout" error. I'm currently trying to figure out if it's me causing the error. I may need to get the yolov4 tutorial's frozen .pb file. If one of us could test that model on your new compiler that may give us a working baseline. Thoughts?

paolodalberto commented 2 years ago

At this time I cannot run anything on FPGA, for other reasons. Run time is not my forte. but let me ask ...

paolodalberto commented 2 years ago

let me see if I can run this now ... where is the xmodel ?

paolodalberto commented 2 years ago

the closest I can find is a caffe implementation 416x416 .. not really the same

Xilinx / Vitis-AI

AssertionError: the memory allocation MUST BE DONE BETTER RESHAPE #664