analogdevicesinc / ai8x-synthesis

Quantization and Synthesis (Device Specific Code Generation) for ADI's MAX78000 and MAX78002 Edge AI Devices
Apache License 2.0
55 stars 47 forks source link

Block diagram in userguide page 381 #256

Closed hyunjongL closed 1 year ago

hyunjongL commented 1 year ago

There is a dark blue aggregator and three light blue aggregators.

  1. Is the dark blue aggregator used when the quadrant is the master quadrant?
  2. If a quadrant is not the master quadrant would the dark blue aggregator work as if it were a light blue aggregator?
  3. I want to know more on how the output of each processor is passed on to other parts. Does the outputs first go to the light blue aggregator that makes a partial sum of products? Are the results all concatenated and sent to the master quadrant? Why is there an input from a light blue aggregator to the group's shared memory?
  4. (For a single pass, where input channels less than 64) Does the data write work in parallel with the processors, or does it wait for partial/entire results (and processors also wait for data writes to finish).
  5. For multi passes, how is it different? Does the multipass accumulator use data SRAM, or is it a cache within the dark blue aggregator?

Many thanks!

rotx-maxim commented 1 year ago
  1. Yes
  2. Yes
  3. The non-master quadrants' results are sent to the master quadrant. The reason for a path to local memory: Depending on the operation, there is no aggregation of results (for example, in "passthrough mode").
  4. There are cases where the output can work in parallel (for example, passthrough operations that write to local memory only), but for most operations, including convolutions, the output is serialized.
  5. The multi-pass data are stored in non-user accessible internal registers.
hyunjongL commented 1 year ago

Thanks, this is the part I am having the most difficulty with understanding this board. I have been measuring latency for different number of channels. One thing I noticed is that each channel introduces an additional computation overhead (20us, but not sure if this is model/data dependent), not just the time for data write. Is there a control or synchronization cost for each processors that happen sequentially? For example, does loading the next weights, or configuring processors (layer info, TRAM reset, ...) happen in sequence?

rotx-maxim commented 1 year ago

Configuring, and loading data and weights happens sequentially on the selected clock. For the inference time, I'm going to send you a (simplified) piece of code that calculates the number of cycle for simple cases (i.e., no streaming, no element-wise operations). We'll eventually add this to the toolset, but it's not quite ready for that yet. Let me know where you want me to email the code.

hyunjongL commented 1 year ago

Please send it to hyunjongl@kaist.ac.kr . Thanks!