alibaba / TinyNeuralNetwork

TinyNeuralNetwork is an efficient and easy-to-use deep learning model compression framework.
MIT License
761 stars 116 forks source link

LOG_SOFTMAX not being quantized #119

Closed travisjayday closed 2 years ago

travisjayday commented 2 years ago

Hi there, LOG_SOFTMAX isn't being quantized to INT8. The converter adds a dequantize layer before the LOG_SOFTMAX node. This is not the behavior when using the regular converter from tensorflow (their converter quantizes the LOG_SOFTMAX so that the model's output is INT8). Do you know why this is and if it can be achieved easily? Maybe I'm just missing something but I looked through the source and can't find any extra arguments or config options.

Here's a minimal example:

import torch
import torch.nn as nn
import torch.nn.functional as F

from tinynn.util.util import import_from_path
from tinynn.converter import TFLiteConverter
from tinynn.graph.quantization.quantizer import PostQuantizer
from tinynn.util.train_util import DLContext
from tinynn.util.cifar10 import calibrate
from tinynn.graph.tracer import model_tracer

import torchvision
import torchvision.datasets as datasets

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc = nn.Linear(28*28, 10)

    def forward(self, x):
        return F.log_softmax(self.fc(torch.reshape(x, (-1, 28*28))), dim=-1)

if __name__ == '__main__':
    model = Net()
    tflite_path='ptq/model.tflite'

    dataloader = torch.utils.data.DataLoader(
        torchvision.datasets.MNIST('./data/', 
            train=False, download=True,
            transform=torchvision.transforms.Compose([
                torchvision.transforms.ToTensor(),
            ])),
        batch_size=1
    )

    _, inp = next(enumerate(dataloader))
    dummy_input = inp[0]
    with model_tracer():
        quantizer = PostQuantizer(model, dummy_input, 
            work_dir='ptq', config={'asymmetric': True, 'per_tensor': False})
        ptq_model = quantizer.quantize()

    context = DLContext()
    context.train_loader = context.val_loader = dataloader
    context.max_iteration = 100
    calibrate(ptq_model, context)
    with torch.no_grad():
        ptq_model.eval()
        ptq_model.cpu()
        ptq_model = torch.quantization.convert(ptq_model)
        torch.backends.quantized.engine = quantizer.backend
        converter = TFLiteConverter(ptq_model, dummy_input, tflite_path=tflite_path,
                                    tflite_micro_rewrite=True,
                                    fuse_quant_dequant=True,
                                    rewrite_quantizable=True,
                                    quantize_target_type='int8')
        converter.convert()
peterjc123 commented 2 years ago

Please update to the latest commit and apply the following configuration to the quantizer.

quantizer = PostQuantizer(model, dummy_input, 
            work_dir='ptq', config={'asymmetric': True, 'per_tensor': False, 'set_quantizable_op_stats': True})
peterjc123 commented 2 years ago

Do you know why this is and if it can be achieved easily?

Because in PyTorch, quantization for log_softmax is not supported. So we have implemented the rewrite_quantizable pass in the TFLiteConverter for rewriting those floating kernels to quantized kernels. But for log_softmax, it has to be used together with the usage of set_quantizable_op_stats in the PostQuantizer. Even with that option, it is not enough because log_softmax is the last operation in your computation graph. While rewriting the model, it becomes ...-log_softmax-dequantize-quantize and then the graph optimizer just removes the consecutive dequantize and quantize nodes, which makes it impossible for the TFLiteConverter to restore it back to quantized kernel because the q-params is lost. So I pushed a new commit to fix this issue.

travisjayday commented 2 years ago

Hi peter! Thank you for your very quick response and commit. The new commit does seem to fix the quantization issue in the graph!!

However, I found some weird behavior which might be a Tensorflow Lite issue and not a TinyNeuralNetwork issue.

If we use

interpreter = tf.lite.Interpreter(model_path=tflite, 
    experimental_op_resolver_type=tf.lite.experimental.OpResolverType.BUILTIN_REF)

The output of the softmax is always -128.

In any case, using the default op_resolver_type works as expected now! Once the model gets deployed to MCU I'll come back and comment on whether it worked or not. For now, this issue can be closed I think.

Thanks again!!

peterjc123 commented 2 years ago

The output of the softmax is always -128.

You can use

interpreter = tf.lite.Interpreter(..., experimental_preserve_all_tensors=True)

input_details = interpreter.get_input_details()
tensor_details = interpreter.get_tensor_details()

dummy_input = np.random.random(size=(1, 224, 224, 3)).astype('float32')

interpreter.set_tensor(input_details[0]['index'], dummy_input)
interpreter.invoke()

for i in range(len(tensor_details)):
    print(i, tensor_details[i]['name'], interpreter.get_tensor(tensor_details[i]['index'])

to find out the layer which is not correctly implemented and report the issue to Tensorflow Lite.