analogdevicesinc / ai8x-training

Model Training for ADI's MAX78000 and MAX78002 Edge AI Devices
Apache License 2.0
89 stars 80 forks source link

How to Implement only ReLU without using convolutions (1d) #305

Closed LochanaMendis closed 3 months ago

LochanaMendis commented 5 months ago

I wanted to implement only the ReLU activation without using operations like convolution. For example, lets say, I want to add two conv1d outputs and then apply ReLU. It raises the error "Activations cannot be used with passthrough layers". As an alternative method, can I use Identity 1D conv with ReLU?

rotx-eva commented 5 months ago

Yes, activations can be used when convolutions are used. There are some corner cases of element wise + convolution that are unsupported but the tool should tell you whether it's expected to work.

LochanaMendis commented 5 months ago

@rotx-maxim So, does that mean I can't use the ReLU activations without the convolutions?

Also, is the below implementation of onlyReLU class correct? `class FusedReLUonly(Conv1d): """ ReLU Activation Function """ def init(self, *args, *kwargs): super().init(args,kernel_size=1, activation='ReLU', **kwargs)

Initialize weights to 1

    self.op.weight.data.fill_(1.0)

    # Make weights non-trainable
    for param in self.op.parameters():
        param.requires_grad = False`

The idea is the identity convolution will have kernel size 1, pass the input to output without any change, and then apply ReLU activation. Here the convolution weights should be equal to 1 and should not update the weights during training.

LochanaMendis commented 5 months ago

Hi @rotx-maxim, I think I got it to work. Thanks! """ class FusedReLUonly(Conv1d): def init(self,in_channels, *args, *kwargs): super().init(in_channels,args, activation='ReLU', **kwargs)

Initialize weights to 0

    nn.init.zeros_(self.op.weight)
    # set identity kernel
    self.op.weight.data[:, :, 0] = torch.eye(in_channels, in_channels)
    # Make weights non-trainable
    for param in self.op.parameters():
        param.requires_grad = False
    print(self.op.weight.requires_grad)`

"""

ermanok commented 5 months ago

Hello,

The approach you proposed works well for model training until Quantization Aware Training starts. As the layer weights will be quantized in that mode, the 1s in your model will be replaced by 0.9921875(=127/128)s as all the 8-bit values {-128, -127, -126, …, 126, 127} in the hardware corresponds to floating point numbers [-128.0/128, -127.0/128.0, -126.0/128.0, 126.0/128.0, 127.0/128.0]. So you may observe slightly different results than having a ReLU activation layer.

For your case, it is better to use a regular nn.ReLU layers when you need ReLU activation without a conv layer. You also need to add clamping as all activations >= 1 should be clamped. I think the easiest way to have this utility to run correctly is modifying ai8x.FusedSoftwareLinearReLU class by removing the self.op parameter. Then you will have such a layer:

class ReLU(nn.Module):
    def __init__(self,):
        super().__init__()

        if dev.simulate:
            self.quantize = Quantize(num_bits=dev.DATA_BITS)
            bits = dev.FC_ACTIVATION_BITS
            self.clamp = Clamp(min_val=-(2**(bits-1)), max_val=2**(bits-1)-1)
        else:
            self.quantize = Empty()
            self.clamp = Clamp(min_val=-1., max_val=127./128.)

        self.activate = nn.ReLU(inplace=True)

    def forward(self, x):  # pylint: disable=arguments-differ
        """Forward prop"""
        x = self.clamp(self.quantize(self.activate(x)))
        return x

For the model synthesis step, you need to add an extra layer to the model checkpoint. The 'add_fake_passthrough.py' utility does this operation and there is an example usage here. This utility adds an 'identity layer' after the specified layer of a model checkpoint. So, you will have the exact ReLU layer at the quantized model…

Note that, you have to add the fake passthrough layers to the yaml file of the model, which will be used to synthesize the model for the hardware, and add ReLU as the activation of that layer ('activate: ReLU').

LochanaMendis commented 5 months ago

Hi @ermanok,

I appreciate your quick response. I understand the use of a fake passthrough to use ReLU. My setup is that I am using MAX78002 and the model I am training is 1d model for time series classification. When I train with the above ReLU class, although my training AUC is good my evaluation result is poor performance. I think the main reason is the use of self.quantize in evaluation. When I remove it the evaluation result in performance similar to the training. Is the self.quantize important for ReLU class?

ermanok commented 5 months ago

self.quantize is required if you want to evaluate your quantized model, which you can have it using the quantize.py utility in the synthesis repo. The model checkpoint you obtained after model training still has floating point weights even the QAT has initiated. Are you sure that you are evaluating the quantized model?

LochanaMendis commented 5 months ago

Thanks for the clarification. Yes I am evaluating the quantized model. I am a bit concerned and trying understand where the issue for getting almost zero values as output for evaluation. As I understand, quantization is the process of converting floating point weights from a checkpoint file into fixed-point weights. The ReLU function does not have weights, so can you help me understand why quantization is needed?

ermanok commented 4 months ago

Just to be safe. In that mode, the quantization function rounds the numbers as in the hardware so it is guaranteed to have a matched output with the hardware even you have a non integer output before that layer...

LochanaMendis commented 4 months ago

Hi @ermanok, Thanks for the help. I have a different question. Is it possible that the evaluation performance metrics differ from the on-chip performance metrics? That is, can the model output from evaluation be different from the model output value taken from the inference on the chip for the same input?

ermanok commented 4 months ago

They should be exactly same, but if you run the evaluation on GPU, it sometimes produces slightly different outputs. These minor might cause rounding errors at some layers and these rounding errors accumulate until the last layer of the network. For some cases, the accumulated errors may cause different decisions at the output. But when the evaluation is run on CPU, the outputs of the model evaluation and hardware inference should be exactly same..

LochanaMendis commented 4 months ago

Thanks for the quick reply. Below is the evaluation output, when I print the model output values (2 class classification). I have 32 output bits set in the last layer.

==> Sample 8 output softmax: [0.56183845 0.43816155] output: [-0.01205444 -0.26068115] target: 0
==> Sample 9 output softmax: [0.6063965  0.39360353] output: [ 0.0803833  -0.35180664] target: 0
==> Sample 10 output softmax: [0.5462692 0.4537308] output: [-0.01034546 -0.19595337] target: 0
==> Sample 11 output softmax: [0.666334   0.33366603] output: [ 0.14724731 -0.5444031 ] target: 0
==> Sample 12 output softmax: [0.53087497 0.469125  ] output: [ 0.03030396 -0.09335327] target: 0
==> Sample 13 output softmax: [0.38312188 0.61687815] output: [-0.19671631  0.27960205] target: 1

I have changed the sample.py code to print above values and save all the test samples.

import numpy as np

def softmax(x): 
    """
    Compute softmax values for each sets of scores in x.
    Implemented by Lochana
    Arguments:
    x -- A numpy array of shape (n,) containing the raw scores

    Returns:
    A numpy array of same shape as x containing the softmax values
    """
    exp_scores = np.exp(x - np.max(x, axis=1, keepdims=True))  # Subtracting the maximum value for numerical stability
    return exp_scores / np.sum(exp_scores, axis=1, keepdims=True)

def generate(
        index,
        inputs,
        targets,  # pylint: disable=unused-argument
        outputs,  # pylint: disable=unused-argument
        dataset_name,
        search=False,  # pylint: disable=unused-argument
):
    """
    Save the element `index` from the `inputs` batch to a file named "sample_`dataset_name`.npy".
    If `search`, then check `outputs` against `targets` until a match is found.
    """
    if index >= len(inputs):
        raise ValueError('--generate-sample index is larger than the data batch size')

    sample_name = 'sample_' + dataset_name.lower()

    # FIXME: Implement search
    found  = False
    if search:
        for i in range(len(inputs)):
            output_softmax = softmax(outputs.cpu().numpy())
            print(f'==> Sample {i} output softmax: {output_softmax[i]} output: {outputs.cpu().numpy()[i]} target: {targets[i]}')
            if found:
                continue
            if ((targets[i]==1) and (output_softmax[i][1]>=0.5)):
                index = i
                print(f'==> Saving sample at index {index} to {sample_name}.npy')
                found = True

    print(f'==> Saving sample at index {index} to {sample_name}.npy')
    x = inputs[index].cpu().numpy().astype('int64')
    x = np.clip(x, -128, 127)
    np.save(sample_name, x, allow_pickle=False, fix_imports=False)

    # print('==> Savings all test samples to testdata_samples folder')
    # for i in range(len(inputs)):
    #     x = inputs[i].cpu().numpy().astype('int64')
    #     x = np.clip(x, -128, 127)
    #     np.save(f'testdata_samples/all_/{sample_name}_'+str(i)+f'_target_{targets[i]}', x, allow_pickle=False, fix_imports=False)

    return False

So when I give the saved sample test files to the chip, should the output from the chip be the same as the output values above (before softmax) but in q17.14 format?

ermanok commented 4 months ago

Yes, that is the expected output at the hardware...

github-actions[bot] commented 3 months ago

This issue has been marked stale because it has been open for over 30 days with no activity. It will be closed automatically in 10 days unless a comment is added or the "Stale" label is removed.