analogdevicesinc / ai8x-synthesis

Quantization and Synthesis (Device Specific Code Generation) for ADI's MAX78000 and MAX78002 Edge AI Devices
Apache License 2.0
55 stars 50 forks source link

confusion for memory overwrite #316

Closed nikky4D closed 9 months ago

nikky4D commented 11 months ago

Hi, I am working on the he fpn detector example, and would like to know if I modify my FPN so that I use the 64x80 output, and drop the 4x5 output, what is the best way to setup the classification and regression memory location for this higher filter set.

For example, the comments show a mapping of the memory:

# Class predictions   : (32x40 + 16x20 + 8x10 + 4x5) * 6 * 21 = 10200 * 21
#                       0x0000 - 0xD480 (1700 x 32: wide & multi-pass)
# 0x0000 - 0xA000: 32x40x121 (&wide)
# 0xA000 - 0xC800: 16x20x121 (&wide)
# 0xC800 - 0xD200: 8x10x121 (&wide)
# 0xD200 - 0xD480: 4x5x121 (&wide)
#
# Location predictions: (32x40 + 16x20 + 8x10 + 4x5) * 6 * 4 = 10200 * 4
#                       0xD500 - 0xEF90 (1700 x 4)
#
# 0xD500 - 0xE900: 32x40x24
# 0xE900 - 0xEE00: 16x20x24
# 0xEE00 - 0xEF40: 8x10x24
# 0xEF40 - 0xEF90: 4x5x24

I am reworking this for my own scenario of dropping the 4x5 but using the 64x80 and where I have only 2 classes and same filter shapes like this:

# Class predictions   : (64x80 + 32x40 + 16x20 + 8x10) * 6 * 2 = 
#                       0x0000 - 0xD480 (6820 x 16: wide & NOT multi-pass)
# 0x0000 - 0x2800: 64x80x12 (&wide)
# 0x2800 - 0x4800: 32x40x12 (&wide)
# 0x4800 - 0x6800: 16x20x12 (&wide)
# 0x6800 - 0xD200: 8x10x12 (&wide)

I set the out_offset of largest classification output to 0x0000. Then out_offset of 32x40 to 2800, and so on.

While this setup synthesizes, would it cause memory overwrites/corruption of the 4 outputs since the memory locations overlap?

seldauyanik-maxim commented 11 months ago

Dear Nikky,

Yes with the above setup, memory locations would overlap and it would cause corruption: both location and classification outputs should stay clean for latter NMS processing.

64x80 and wide output fill up whole data memory instance: (data memory instance size of 81920 = 0x14000): 64x80x4x4 => 0x00000 - 0x14000 meaning you will overwrite if you use the same quadrant for any other layer's processing.

As you are having 2 classes, you may consider moving 64x80 resolution layer output to another data memory instance that can be used in parallel (for 64x80 res: output_processors: 0x000000000fff0000, for others: output_processors: 0x0000000000000fff)

For proper modification, you need to:

0x0000 - 0x14000: 64x80x12 (&wide) Quadrant 1

0x0000 - 0x5000: 32x40x12 (&wide) Quadrant 0

0x5000 - 0x6400: 16x20x12 (&wide) Quadrant 0

0x6400 - 0x6900: 8x10x12 (&wide) Quadrant 0

Location predictions:

0x6900 - 0xB900: 64x80x24 Quadrant 0

0xB900 - 0xEB00: 32x40x24 Quadrant 0

0xEB00 - 0xF000: 16x20x24 Quadrant 0

0xF000 - 0xF140: 8x10x24 Quadrant 0


However above changes are not solely sufficient: for these kind of changes, FPN detector layers' feature outputs (and their prior dependants: enc and skip layers) also have to be analysed as they should not be overwritten until classification and regression layers use those either (Please also go through layers like 16, 20, 24, 28, 31, 32, 34, 37, 40, 43 for especially managing high res intermediate outputs. 

After yaml and memory mapping is verified, NMS code should also be modified for reading proper class and location outputs from valid memory locations. Location outputs reading function just need memory address update but you should go over class predictions as it assumes all quadrants are active and classification outputs reside in the same quadrant
nikky4D commented 11 months ago

Thank you for very much for this response. It is very helpful.

Could you expand more on this parallel usage here:

As you are having 2 classes, you may consider moving 64x80 resolution layer output to another data memory instance that can be used in parallel (for 64x80 res: output_processors: 0x000000000fff0000, for others: output_processors: 0x0000000000000fff)

For example: if I move the FPN output for 64x80 res to another set of output processors, how would I specify its location to the classification/regression layer that needs it meaning what processors are set: eg:

  - in_sequences: FPN_out_64_80
    in_offset: 0xF0A0
    out_offset: 0x10AE0
    processors: 0xffffffffffffffff  ##<-- Does this change?
    output_processors: 0xffffffffffffffff
    operation: conv2d
    kernel_size: 3x3
    pad: 1
    activate: ReLU
    write_gap: 1
    name: loc_64_80_res0_preprocess
    weight_source: loc_64_80_res0_preprocess
github-actions[bot] commented 10 months ago

This issue has been marked stale because it has been open for over 30 days with no activity. It will be closed automatically in 10 days unless a comment is added or the "Stale" label is removed.