analogdevicesinc / ai8x-synthesis

Quantization and Synthesis (Device Specific Code Generation) for ADI's MAX78000 and MAX78002 Edge AI Devices
Apache License 2.0
57 stars 50 forks source link

KAT error #353

Open eshankn opened 2 months ago

eshankn commented 2 months ago

Hello, for my application I have been able to generate the C code using the synthesizer tool. But while testing the code on the hardware, the KAT fails with the Data mismatch error.

The sample input provided is 1D-shaped data of size, 1 x 768 and the model is as follows

self.block1_conv1d_bn_relu_1 = ai8x.FusedConv1dBNReLU(in_channels=1, out_channels=1, kernel_size=2, stride=1, padding=0)
self.block1_conv1d_bn_relu_2 = ai8x.FusedConv1dBNReLU(in_channels=1, out_channels=1, kernel_size=2, stride=1, padding=0)
self.block1_conv1d_bn_relu_3 = ai8x.FusedConv1dBNReLU(in_channels=1, out_channels=1, kernel_size=2, stride=1, padding=0)
self.block1_conv1d_bn_relu_4 = ai8x.FusedConv1dBNReLU(in_channels=1, out_channels=1, kernel_size=2, stride=1, padding=0)
self.block1_maxpool1d_1 = ai8x.MaxPool1d(kernel_size=3, stride=3)

self.block6_flatten = nn.Flatten()
self.block6_dense = ai8x.Linear(in_features=254, out_features=2)

I am unable to interpret the error but I believe the KAT failing is due to the processor mapping in the YAML file. After multiple trial and error with the processor configurations, the code could pass the KAT. Though I have a fair understanding of creating the YAML file from the provided documentation, I am unable to understand certain configurations when compared with energy_profiling_kat_pass.yaml.

  1. Using quadrant 0 for layer 3 processors (as the previous layers) fails the KAT.
  2. Using quadrant 1 for layers 4 and 5 processors (as layer 3) also fails the KAT.
  3. Not specifying output_processors for the last layer 5 fails the KAT.
  4. Using quadrant 2 for layer 5 output_processors fails the KAT.
  5. Using output_processors: 0x0006.0000.0000.0000 or output_processors: 0x0009.0000.0000.0000 or output_processors: 0x0011.0000.0000.0000 for layer 5 also fails the KAT.

ep_demo.zip contains the necessary files to replicate the above specific scenarios.

The C code was generated using python ai8xize.py --verbose --test-dir demos --prefix ep_demo --checkpoint-file ep_demo_qat_best-q.pth.tar --device MAX78000 --softmax --compact-data --sample-input sample_ep_demo.npy --config-file energy_profiling_kat_pass.yaml --energy

I was also unable to use --stop-after to debug the problematic layer.

EDIT: Using out_channels=2 or out_channels=4 after the first layer while training and then generating the code passes the KAT for the first four above scenarios. The fifth scenario still fails the KAT.

alicangok commented 2 months ago

Hi,

First of all, thanks for your detailed explanations and providing the necessary files.

We have been able to reproduce the errors in the scenarios you provided on our end, though we do not have a definite solution at the moment.

Our initial theory for the cause of the problem was the use of the Flatten layer right after MaxPool1D without a convolution operation, with the knowledge that flattening cannot be used in the same layer as pooling, as described here. However, in your case, your network has a standalone MaxPool1D layer, which is allowed even without a fused convolution operation. Nevertheless, as a sanity check, we explicitly added a "fake passthrough" layer after the max pooling, and it indeed did not solve the problem. (Note: The 'add_fake_passthrough.py' utility adds an 'identity layer' after the specified layer of a model checkpoint, to circumvent certain limitations. See example usage here.)

We will continue looking into this problem. In the meanwhile, I'm hoping that you are not stuck in your work, and can deploy your network on the hardware thanks to your workarounds.

eshankn commented 2 months ago

@alicangok thank you for providing the update.

At the moment, I am able to deploy my end application network on the hardware. Thanks to the hardware's capabilities, I have been curious about the energy consumption for variations of my network. In addition, I have a few related questions.

  1. Is there a way to measure the energy consumed by each layer of the deployed network on the hardware?
  2. Is there some constraint for input data dimensions or the number of network operations for measuring the energy consumed? I want to reduce the total energy consumption for my application and for that, I have been experimenting with smaller networks and input dimensions. To give more context, I created a simple network as below and provided 1D input data of varying dimensions (with varying other network parameters accordingly to satisfy the limitations).
self.block1_conv1d_bn_relu_1 = ai8x.FusedConv1dBNReLU(in_channels=1, out_channels=4, kernel_size=2, stride=1, padding=0)
self.block6_flatten = nn.Flatten()
self.block6_dense = ai8x.Linear(in_features=124, out_features=2)

I can measure energy through the PMON and an external device for input data dimensions, 1 x 64 and higher. However, that is not the case for input data dimensions, 1 x 32 and smaller. The serial output is stuck while Measuring input load + inference... because the CNN_COMPLETE trigger is not generated. I understand that the PMON measurement repeats each operation 100 times to accumulate enough energy and my initial guess was the low number of network operations. Therefore, I increased the complexity of the network by gradually adding multiple convolution and pooling layers in between but that did not change the result either.

alicangok commented 2 months ago

Hello again @eshankn,

Regarding your first question, while there is no direct method to measure the energy consumed by each layer, you may use the --stop-after argument with consecutive layers and subtract the consumed energies to get an estimate.

As for your second question, I am not aware of such an explicit constraint. However, with very small & fast networks, the hardware may hang if the inference finishes before the main code had a chance to enter sleep mode. I would suggest you to try the --no-wfi argument to disable sleep mode for your testing. More details are provided here.

P.S. We will let you know once we have better understanding of your earlier issue. We have been continuing our investigation in different scenarios and both MAX78000 and MAX78002 hardware.

eshankn commented 2 months ago

@alicangok thank you for your response. Using --no-wfi does the trick and the inference is completed for the case mentioned!

Regarding the KAT error, I have another scenario. According to Limitations of MAX78000 Networks, the maximum dimension (number of rows or columns) for input or output data is 1023. In theory, a 1D-shaped input data of size, 1 x 1019 with padding size 2 is acceptable and the synthesis tool also generates the C code. But the KAT fails with the same Data mismatch error. Input data dimensions, 1 x 1018 and 1 x 1017 also return the same error while input data of size, 1 x 1016 passes the KAT, which I am unable to comprehend.

Additionally, an input data of size, 1 x 1021 throws an error from the synthesis tool as expected due to exceeding the dimension limit. However, the synthesis tool can generate the C code when input data of size, 1 x 1020 is provided. Although it gives the same KAT error, it conflicts with the constraint mentioned, considering the effective dimension is 1024 (input dimension of 1020 + padding of 2 on either side). I assume the line decrements the input dimension and thus allowing a maximum dimension of 1024. Please correct me if I am wrong in my understanding.

ep_demo_input_size.zip contains the necessary files to replicate the above scenario.

github-actions[bot] commented 1 month ago

This issue has been marked stale because it has been open for over 30 days with no activity. It will be closed automatically in 10 days unless a comment is added or the "Stale" label is removed.

ermanok commented 3 weeks ago

Thanks for reporting this issue. In this case, the network fails not due to the convolution at the initial layer but the maxpool operation at the second layer. For the maxpool, the input length + kernel length should be smaller than 1026 but until the input length of 1016, the input length of the second layer is greater than 1017, which should be the max input length for a pooling layer with 8-length kernel. We will update the documentation accordingly and/or put proper assertions to the synthesize code.