analogdevicesinc / ai8x-synthesis

Quantization and Synthesis (Device Specific Code Generation) for ADI's MAX78000 and MAX78002 Edge AI Devices
Apache License 2.0
55 stars 47 forks source link

Out_offset (YAML) and SRAM write pointer #255

Closed hyunjongL closed 1 year ago

hyunjongL commented 1 year ago

https://github.com/MaximIntegratedAI/ai8x-synthesis/blob/develop/izer/backend/max7800x.py#L1673

instance = ffs(output_processor_map[ll] >> group * tc.dev.P_SHARED) \
  & ~(tc.dev.P_SHARED-1)
val |= (instance + group * tc.dev.P_SHARED) * tc.dev.INSTANCE_SIZE

According to the code, the actual value written to the register is 1/4 (out_offset (from YAML) + 0x8000 smallest_out_processor_group_index).

  1. Is the out_offset/in_offset in bytes and the write ptr in address, where each address points at 32-bit word? I think this is why I am getting confused.
  2. When writing data, where and how does the address translation work? Because this address scheme does not match with that in the user guide where the SRAM from different quadrants are separate.

(Below is a copy of documentation from the same file.)

Configure SRAM write pointer -- write ptr is global (unless depth-wise w/o broadcast is used). Get offset to first available instance of the first used processor of the next layer.

rotx-maxim commented 1 year ago
  1. Yes, that's exactly correct. We specify all user facing addresses in bytes.
  2. When writing, the global start address of the targeted output memory is added to the offset. For example, offset 0x2000 (in bytes) for processor 4 becomes register value 0x00002800 (= 0x2000/4 + (0x50408000-0x50400000)/4), where 0x2000 is the specified target offset in bytes and 0x50408000 is the byte address of the second data memory instance, which is the one used by processor 4).
hyunjongL commented 1 year ago

So the memory address the master quadrant is using is different with the memory configuration in page382 of user guide, not only that it points to 32-bit words per but also all the write addresses are continuous? In the figure, the SRAM addresses are apart from each other. @rotx-maxim

Thanks!

rotx-maxim commented 1 year ago

The user guide map is an abbreviation of the memory map. Addresses in the map are represented as 32-bit, but since the memories are not byte addressable, the bottom bits are not used. When programming the accelerator, the bottom bits are not programmed and neither is the global address offset (i.e., the accelerator gets native word addresses). The memory space is contiguous in this case, but it's not guaranteed for future devices. There may be "holes" in the address space. However, the accelerator always knows where to write the "next" word. Check docs/AHBAddresses.md for another view of the addresses.

hyunjongL commented 1 year ago

I figured out that the write pointer does not jump if there is any unused processors within the allocation. (0x800...001 takes a lot of time compared to 0x000...00F)

  1. Does it write 8 bit each cycle?
  2. Does the writer skip writing and moves on every time or does it overwrite what's written in the memories inbetween the processors to zero.
  3. Is there any way to jump through pointers?
rotx-maxim commented 1 year ago
  1. No, the memory is a 32-bit memory, and four 8-bit values are combined into a single 32-bit write
  2. When you look at the generated code, the "skipped" processors are actually enabled with weights of zero. This means that the output is written, with zeros.
  3. No, there is not
hyunjongL commented 1 year ago

0 - Does the writer write 32bits for the four channels even if only one processor will be used? 0.2 - If the writer writes overwrite the other 24bits to zero if the above happens? 1 - Let's say for the first quadrant I assign 8 processors 0x0F0F. Then the group in-between will have zeros in their memory and the memory that is not inbetween (the first zero) will not be overwritten. 2 - Okay I guess we have to make sure to use consecutive processors always.

I closed the other issue, so will just ask it here.

  1. Why can't the configuration for processors be parallelized? Is there a memory access or operation that needs to run in sequences?
  2. The configuration per processor took longer than what I have expected (20us per processors when running a 3x3 conv2d) (my mistake! it takes around ~2 us). What is the main bottleneck there and does it depend on what operation the processor will run? (e.g., 3x3 conv2d will load 9 weights while 1x1 conv2d needs to load only 1, but that dont seem to make a lot of difference)
rotx-maxim commented 1 year ago
  1. Yes, the memory is always accessed as 32-bit. Therefore, since there is a correlation between channel count and processor count, we recommend multiples of 4, 16, 64, to use the hardware to full advantage. 0.2 Zero (when the weights are zero) or whatever the result of the other processors is
  2. This is correct, however, keep in mind that (assuming 8-bit outputs) the machine always writes 32-bit (groups of 4 channels).
  3. The CNN accelerator is a bus peripheral, and you cannot have more than one access to it at any one time.
  4. You have to distinguish between loading weights and loading the configuration data. The weights obviously depend on the operation (for Conv2d, it is the number of input channels times the number of output channels, the kernel size, and the quantization). The layer configuration is just a few short writes, and we skip the ones that are zero, so the "simpler" an operation, the fewer writes. Both the weights and the layer setup will stay configured, and don't have to be reloaded for each inference.
hyunjongL commented 1 year ago

One thing I find strange is that it takes different time to write to these output_processors. 0x0011, 0x0021, 0x0041, 0x0081 If the memory writes are performed 32 bits each, I think they should take the same time, however, 0x0011 takes the least time and 0x0081 takes the most.

On the other hand, 0x0011, 0x0012, 0x0014, and 0x0018 take the same time, which I do feel the memory writes are performed by 32 bits each.

Are there any exceptions for the group with the highest index?

And about loading weights, I meant loading a kernel during the inference from the weights memory to a processor.

rotx-maxim commented 1 year ago

Check the processor and mask enable bits in the generated code. Hard to give a definitive answer without seeing the code (please email if needed)