GreenWaves-Technologies / gap_sdk

SDK for Greenwaves Technologies' GAP8 IoT Application Processor
https://greenwaves-technologies.com/en/gap8-the-internet-of-things-iot-application-processor/
Apache License 2.0
140 stars 78 forks source link

Batchnorm not directly after a convolution layer causes nntool error #226

Closed neoamos closed 3 years ago

neoamos commented 3 years ago

If you have a batchnorm directly after a convolution layer, it gets folded into the convolution by the tflite converter. If its not after a convolution layer, it adds a multiply and add operator, but this causes the following nntool error:

Traceback (most recent call last):
  File "/gap_sdk/tools/nntool/nntool", line 101, in <module>
    main()
  File "/gap_sdk/tools/nntool/nntool", line 85, in main
    mod.generate_code(args)
  File "/gap_sdk/tools/nntool/interpreter/generator.py", line 93, in generate_code
    write_template(G, code_gen, opts['model_directory'], opts['model_file'], code_template, "model")
  File "/gap_sdk/tools/nntool/interpreter/generator.py", line 47, in write_template
    model = template(G, code_generator=code_gen)
  File "/gap_sdk/tools/nntool/generation/default_template.py", line 193, in default_template
    return execute_template(generator_template_v3, G, naming_convension, code_generator)
  File "/gap_sdk/tools/nntool/generation/default_template.py", line 187, in execute_template
    return template_function(G, code_generator)
  File "/gap_sdk/tools/nntool/generation/default_template.py", line 94, in generator_template_v3
    ${gen.kernel_generator(indent=1)}
  File "/gap_sdk/tools/nntool/generation/code_generator.py", line 387, in kernel_generator
    if not self.execute_phase("kernels", node, qrec, in_eparams, out_eparams, cname):
  File "/gap_sdk/tools/nntool/generation/generators/generator_decorators.py", line 79, in execute_phase
    this_res = gen['func'](self, param, qrec, *args, **kwargs)
  File "/gap_sdk/tools/nntool/generation/generators/kernels/mult8/matadd_kernels_generator.py", line 43, in matadd_kernel_generator
    raise ValueError("missing generator: the matrix add generator only handles adds of tensors of the same size")
ValueError: missing generator: the matrix add generator only handles adds of tensors of the same size

The definition of the network:

def batch_norm(img_width, img_height, img_channels, output_dim):
   # Input
   img_input = Input(shape=(img_height, img_width, img_channels))

   x1 = Conv2D(32, (5, 5), strides=[2,2], padding='same')(img_input)
   x1 = MaxPooling2D(pool_size=(3, 3), strides=[2,2])(x1)

   # First residual block
   x2 = Conv2D(32, (3, 3), strides=[2,2], padding='same')(x1)

   x2 = keras.layers.normalization.BatchNormalization()(x2)
   x2 = Activation('relu')(x2)
   x2 = Conv2D(32, (3, 3), padding='same')(x2)

   x1 = Conv2D(32, (1, 1), strides=[2,2], padding='same')(x1)
   x3 = add([x1, x2])

   # Batch norm not directly after conv causes error
   x3 = keras.layers.normalization.BatchNormalization()(x3)

   x = Flatten()(x3)

   x = Dense(output_dim)(x)
   nums = Activation('softmax')(x)

   model = Model(inputs=[img_input], outputs=[nums])
   print(model.summary())

   return model

And a visualization of the tflite network: batch norm network

With gap_sdk 3.9.1. If the input to the batchnorm is larger, tflite converts it to just a multiply operation with no add, and nntool doesn't throw an error. I don't know exactly what algorithm tflite is using, but it would be nice if the add would be supported so batchnorm works in all cases.

sousoux commented 3 years ago

try running fusions -a expression_matcher. Be aware that this feature is still in development and will be updated in the next nntool version. There are issues with it in the current released version.

neoamos commented 3 years ago

I'm not sure exactly what you mean by that. Is it an option for the tflite converter or nntool? I notice that the add operator is actually only present if the model is trained and not present if its not trained. I previously said it was not there with larger input, but thats because I wasn't training the model when the input was large. Maybe the converter optimizes it out if its zeroes. Here is the tflite file also: model.tflite.gz

sousoux commented 3 years ago

If you have modified one of our sample projects you will find the nntool script in the model directory. Execute all the commands manually after having opened the graph. Add a fusions -a expression_matcher before saving the state or generating. The add mul add should be sucked into a single operation that will have a kernel compiled for it.

neoamos commented 3 years ago

Oh I understand now, I'm using an example from the AI deck example repo and it has a nntool script with 'fusions --scale8'. What is the expression_matcher supposed to be? I can't find any documentation about it. I tried 'fusions -a scale8_match_group' and it outputs the same result as before.

sousoux commented 3 years ago

If you open nntool and type help or help fusions you will get an explanation about that. fusions -l lists all available fusions. fusions -a expression_matcher attempts to fuse piecewise operations into a single kernel. It should be run after the existing script before quantization if you are not importing a quantized graph.

neoamos commented 3 years ago

When you said to do 'fusions -a expression_matcher' I though expression_matcher was just a placeholder, but I noticed now its an operation you can do. When I did that, It fuses the add and multiply into one operation. It produces this graph:

+------+---------------------------+-------------------------+------------+-------------+--------+--------+--------+---------+--------------------------+--------------------------+
| Step |         Step name         |        Operation        | Input Dims | Output Dims | Inputs | Active | Params |   Ops   |          Params          |          Hints           |
|      |                           |                         |  (hxwxc)   |   (hxwxc)   |        |  size  |  size  |         |                          |                          |
+======+===========================+=========================+============+=============+========+========+========+=========+==========================+==========================+
|  0   | input_1                   | input                   |  28x28x1   |   28x28x1   |        |  784   |   0    |         | I 28x28x1  FIXED_ORDER=0 | in: hxwxc out: none      |
+------+---------------------------+-------------------------+------------+-------------+--------+--------+--------+---------+--------------------------+--------------------------+
|  1   | input_1_formatter         | image_format            |  28x28x1   |   28x28x1   |  0/0   |  1568  |   0    |         | FORMAT_CHANGE Fmt: BW8   | in: none out: none       |
|      |                           |                         |            |             |        |        |        |         | Norm: OFFSET_INT8        |                          |
+------+---------------------------+-------------------------+------------+-------------+--------+--------+--------+---------+--------------------------+--------------------------+
|  2   | DEPTHWISE_CONV_2D_0_0_r_c | reshape                 |  28x28x1   |   1x28x28   |  1/0   |  1568  |   0    |         | SHAPE 1x28x28            | in: none out: none       |
|      | hw                        |                         |            |             |        |        |        |         |                          |                          |
+------+---------------------------+-------------------------+------------+-------------+--------+--------+--------+---------+--------------------------+--------------------------+
|  5   | DEPTHWISE_CONV_2D_0_0_fus | conv_fusion_conv_pool   |  1x28x28   |   32x6x6    |  2/0   |  2768  |  832   | 167.17K | F 32x1x5x5 S 2x2 D 1x1 G | in: hxwxc,out_cxin_cxhxw |
|      | ion                       |                         |  32x1x5x5  |             |  3/0   |        |        |         | 1 M 1 P 1x2x1x2 zero, T  | ,out_c out: cxhxw        |
|      |                           |                         |     32     |             |  4/0   |        |        |         | max F 3x3 S 2x2 P        |                          |
|      |                           |                         |            |             |        |        |        |         | 0x0x0x0 zero             |                          |
+------+---------------------------+-------------------------+------------+-------------+--------+--------+--------+---------+--------------------------+--------------------------+
|  8   | CONV_2D_0_2_fusion        | conv_fusion_conv_active |   32x6x6   |   32x3x3    |  5/0   | 10688  |  9248  |  82.94K | F 32x32x3x3 S 2x2 D 1x1  | in: cxhxw,out_cxin_cxhxw |
|      |                           |                         | 32x32x3x3  |             |  6/0   |        |        |         | G 1 M 1 P 0x1x0x1 zero,  | ,out_c out: cxhxw        |
|      |                           |                         |     32     |             |  7/0   |        |        |         | Activation relu          |                          |
+------+---------------------------+-------------------------+------------+-------------+--------+--------+--------+---------+--------------------------+--------------------------+
|  11  | CONV_2D_0_3               | conv2d                  |   32x6x6   |   32x3x3    |  5/0   |  2784  |  1056  |   9.22K | F 32x32x1x1 S 2x2 D 1x1  | in: cxhxw,out_cxin_cxhxw |
|      |                           |                         | 32x32x1x1  |             |  9/0   |        |        |         | G 1 M 1 P 0x0x0x0 zero   | ,out_c out: cxhxw        |
|      |                           |                         |     32     |             |  10/0  |        |        |         |                          |                          |
+------+---------------------------+-------------------------+------------+-------------+--------+--------+--------+---------+--------------------------+--------------------------+
|  18  | CONV_2D_0_4               | conv2d                  |   32x3x3   |   32x3x3    |  8/0   | 13066  |  9248  |  82.94K | F 32x32x3x3 S 1x1 D 1x1  | in: cxhxw,out_cxin_cxhxw |
|      |                           |                         | 32x32x3x3  |             |  12/0  |        |        |         | G 1 M 1 P 1x1x1x1 zero   | ,out_c out: cxhxw        |
|      |                           |                         |     32     |             |  17/0  |        |        |         |                          |                          |
+------+---------------------------+-------------------------+------------+-------------+--------+--------+--------+---------+--------------------------+--------------------------+
|  19  | expr_0                    | expression              |   32x1x1   |   32x3x3    |  14/0  |  3818  |   0    |         | add: 2, mul: 1           | in: none out: none       |
|      |                           |                         |   32x1x1   |             |  13/0  |        |        |         |                          |                          |
|      |                           |                         |   32x3x3   |             |  11/0  |        |        |         |                          |                          |
|      |                           |                         |   32x3x3   |             |  18/0  |        |        |         |                          |                          |
+------+---------------------------+-------------------------+------------+-------------+--------+--------+--------+---------+--------------------------+--------------------------+
|  20  | FULLY_CONNECTED_0_8       | linear                  |   32x3x3   |     10      |  19/0  |  3188  |  2890  |   2.88K | F 10x32x3x3              | in:                      |
|      |                           |                         |   10x288   |             |  15/0  |        |        |         |                          | cx0x1,out_cxin_c,out_c   |
|      |                           |                         |     10     |             |  16/0  |        |        |         |                          | out: c                   |
+------+---------------------------+-------------------------+------------+-------------+--------+--------+--------+---------+--------------------------+--------------------------+
|  21  | SOFTMAX_0_9               | softmax                 |     10     |     10      |  20/0  |   20   |   0    |      20 | Beta 0.0 Axis 0          | in: none out: none       |
+------+---------------------------+-------------------------+------------+-------------+--------+--------+--------+---------+--------------------------+--------------------------+
|  22  | output_1                  | output                  |     10     |     10      |  21/0  |   10   |   0    |         | O 10  FIXED_ORDER=0      | in: none out: none       |
+------+---------------------------+-------------------------+------------+-------------+--------+--------+--------+---------+--------------------------+--------------------------+
|      | Totals (#)                |                         |            |             |        | 13066  | 46612  | 345.17K |                          |                          |
|      | Max active/Total params   |                         |            |             |        |        |        |         |                          |                          |
+------+---------------------------+-------------------------+------------+-------------+--------+--------+--------+---------+--------------------------+--------------------------+
|      | Totals (#)                |                         |            |             |        |        | 59678  | 345.17K |                          |                          |
|      | Max mem usage             |                         |            |             |        |        |        |         |                          |                          |
+------+---------------------------+-------------------------+------------+-------------+--------+--------+--------+---------+--------------------------+--------------------------+

However, it then throws an error when compiling the autotiler model:

BUILD_MODEL_SQ8BIT/modelModel.c:4:25: warning: implicit declaration of function 'gap_ncore' [-Wimplicit-function-declaration]
 static int ActiveCore = gap_ncore();
                         ^~~~~~~~~
BUILD_MODEL_SQ8BIT/modelModel.c:4:25: error: initializer element is not constant
BUILD_MODEL_SQ8BIT/modelModel.c: In function 'ChunkSize':
BUILD_MODEL_SQ8BIT/modelModel.c:14:13: warning: implicit declaration of function 'gap_fl1' [-Wimplicit-function-declaration]
  Log2Core = gap_fl1(NCore);
             ^~~~~~~
BUILD_MODEL_SQ8BIT/modelModel.c: At top level:
BUILD_MODEL_SQ8BIT/modelModel.c:25:17: error: unknown type name 's19_kernel_args_t'
 void s19_kernel(s19_kernel_args_t *Args) {
                 ^~~~~~~~~~~~~~~~~
model_rules.mk:78: recipe for target 'BUILD_MODEL_SQ8BIT/GenTile' failed

The autotiler model doesn't seem to have generated correctly

#include "modelModel.c"

static int CoreCountDynamic = 1;
static int ActiveCore = gap_ncore();

static inline unsigned int __attribute__((always_inline)) ChunkSize(unsigned int X)

{
    unsigned int NCore;
    unsigned int Log2Core;
    unsigned int Chunk;

    if (CoreCountDynamic) NCore = ActiveCore; else NCore = gap_ncore();
    Log2Core = gap_fl1(NCore);
    Chunk = (X>>Log2Core) + ((X&(NCore-1))!=0);
    return Chunk;
}

#ifndef AT_NORM
#define AT_NORM(x, n)   gap_roundnorm_reg((x), (n))
#endif
#define ATLShift(x, n)  ((x) << (n))

// Output iteration space reduced to 2 iteration spaces
void s19_kernel(s19_kernel_args_t *Args) {
    unsigned int H = Args->H;
    unsigned int W = Args->W;
    signed char * expr_0_in_3 = Args->expr_0_in_3;
    signed char * expr_0_in_2 = Args->expr_0_in_2;
    signed char * expr_0_in_0 = Args->expr_0_in_0;
    signed char * expr_0_in_1 = Args->expr_0_in_1;
    signed char * expr_0_out_0 = Args->expr_0_out_0;
    unsigned int CoreId = gap_coreid();
    unsigned int Chunk = ChunkSize(H);
    unsigned int First = Chunk*CoreId;
    unsigned int Last = gap_min(First+Chunk, H);
    for (int d0=First; d0<Last; d0++) {
        for (int d1_d2=0; d1_d2<W; d1_d2++) {
            expr_0_out_0[d0*9+d1_d2*1] = ((signed char)gap_clip((gap_roundnorm_reg(((gap_roundnorm_reg((gap_roundnorm_reg((gap_roundnorm_reg(((gap_roundnorm_reg((((int)expr_0_in_1[d0*9+d1_d2*1])*25642), 16)+((int)expr_0_in_2[d0*9+d1_d2*1]))*31768), 7)*((int)expr_0_in_3[d0*1])), 7)*27142), 22)+((int)expr_0_in_0[d0*1]))*32079), 16)), (7)));
        }
    }

    gap_waitbarrier(0);
}
gemenerik commented 2 years ago

I get undefined references to all occasions of mul-add fusion kernels generated with the expression_matcher

[...]/BUILD/GAP8_V2/GCC_RISCV_PULPOS/BUILD_MODEL_SQ8BIT/modelKernels.o: In function `hal_spr_read_then_clr':
/home/rik/gap_sdk_490/install/GAP8_V2/include/hal/dma/mchan_v6.h:272: undefined reference to `s241_kernel'
sousoux commented 2 years ago

Add MODEL_EXPRESSIONS = $(MODEL_BUILD)/Expression_Kernels.c to common.mk. It's done automatically by gen_project. We need to make that more automatic.

gemenerik commented 2 years ago

Thank you.

I've been working with the SDK for quite a while now and have a lot of legacy code in my project. I decided to go for a clean start with gen_project and that seems to have fixed it for me. Nice feature!

aqqz commented 2 years ago

I meet this problem seem like this, when I add fusions -a expression_matcher to nntool script, and I add MODEL_EXPRESSIONS = $(MODEL_BUILD)/Expression_Kernels.c in model_decl.mk. My SDK version is 4.8.0 here is the error info: /home/taozhi/tf2/BUILD/GAP8_V2/GCC_RISCV_PULPOS/BUILD_MODEL_SQ8BIT/modelKernels.o: In function `eu_bar_setup_mask': /home/taozhi/gap/gap_sdk/install/GAP8_V2/include/pmsis/implem/dma.h:268: undefined reference to `s2_kernel' /home/taozhi/gap/gap_sdk/install/GAP8_V2/include/pmsis/implem/dma.h:268: undefined reference to `s2_kernel' /home/taozhi/tf2/BUILD/GAP8_V2/GCC_RISCV_PULPOS/BUILD_MODEL_SQ8BIT/modelKernels.o: In function `rt_team_fork': /home/taozhi/gap/gap_sdk/install/GAP8_V2/include/pmsis/implem/dma.h:268: undefined reference to `s2_kernel' collect2: error: ld returned 1 exit status make: *** [/home/taozhi/gap/gap_sdk/utils/rules/pulp_rules.mk:227:/home/taozhi/tf2/BUILD/GAP8_V2/GCC_RISCV_PULPOS/application] error 1

Jool99 commented 7 months ago

I meet this problem seem like this, when I add fusions -a expression_matcher to nntool script, and I add MODEL_EXPRESSIONS = $(MODEL_BUILD)/Expression_Kernels.c in model_decl.mk. My SDK version is 4.8.0 here is the error info: /home/taozhi/tf2/BUILD/GAP8_V2/GCC_RISCV_PULPOS/BUILD_MODEL_SQ8BIT/modelKernels.o: In function `eu_bar_setup_mask': /home/taozhi/gap/gap_sdk/install/GAP8_V2/include/pmsis/implem/dma.h:268: undefined reference to `s2_kernel' /home/taozhi/gap/gap_sdk/install/GAP8_V2/include/pmsis/implem/dma.h:268: undefined reference to `s2_kernel' /home/taozhi/tf2/BUILD/GAP8_V2/GCC_RISCV_PULPOS/BUILD_MODEL_SQ8BIT/modelKernels.o: In function `rt_team_fork': /home/taozhi/gap/gap_sdk/install/GAP8_V2/include/pmsis/implem/dma.h:268: undefined reference to `s2_kernel' collect2: error: ld returned 1 exit status make: *** [/home/taozhi/gap/gap_sdk/utils/rules/pulp_rules.mk:227:/home/taozhi/tf2/BUILD/GAP8_V2/GCC_RISCV_PULPOS/application] error 1

If anyone gets the same issue, I solved a similar error by adding: MODEL_EXPRESSIONS = $(MODEL_BUILD)/Expression_Kernels.c in model_decl.mk.

And then making sure to use MODEL_EXPRESSIONS in these two lines, that were already there:

MODEL_GEN_C = $(addsuffix .c, $(MODEL_GEN)) $(MODEL_EXPRESSIONS)
MODEL_GEN_CLEAN = $(MODEL_GEN_C) $(addsuffix .h, $(MODEL_GEN)) $(MODEL_EXPRESSIONS)