Question about PTQ fake quantized FP32 model to INT8

fabriziojpiva commented 5 months ago

Hi Oscar,

First of all many thanks for your tutorials, they are incredibly useful to learn quantization and get hands-on experience on this!

I have the following situation and perhaps you could enlighten me a bit, since I cannot seem to find a connection between your tutorials and what I want to achieve.

My goal is to be able to quantize a ViT to INT8, any type (for example, DeiT-tiny would be more than enough). To avoid boilerplate and save training time, I have found methods that do PTQ on most ViT versions. An example of these methods is FQ-ViT, that performs PTQ on ViTs and yields a quantized model in INT8 or INT4 (but fake quantized to FP32).

The problem comes when I want to continue the path from the fake quantized model to an actual INT8 model. Since the method already yields a fake quantized model in FP32, do I need to just cast all weights and activation values to INT8? Or do I need to modify the code, add the quantization stubs, re-run the PTQ calibration to build the qconfigs and then save the resulting model?

I would be very happy to hear how you would approach this problem, as there are very little resources on the internet regarding quantizing ViTs.

Best,

Fabrizio

OscarSavolainenDR commented 5 months ago

Hi @fabriziojpiva!

Yeah that sounds a bit tricky. First I'll say that PyTorch's native support for "true" quantization for less than 8 bits is a bit weak, e.g. for int4: https://github.com/pytorch/pytorch/issues/74627. However, the support for int8 should be there. Typically I call torch.ao.quantization.convert which returns a "true" quantized model, and you may need to feed in some custom true quantized modules in the mapping.

What hardware are you targeting? E.g. NVIDIA, Intel chips, etc.?

There's a number of ways this I imagine this might be done for the hardcase of int4:

The "extremely labour intensive way": Build all of the int4 stuff into PyTorch, and do it that way, and eventually PR it in to extreme accolades.
The hacky but maybe easiest: use FQ-ViT to get the fake-quant, save the weights (and qparams) in a .pt file, and do some awkward loading in to ONNX (or wherever fits your hardware target and supports the level of quantization you support, but I think ONNX supports int4) to load the parameters into your desired "true" quantized model. It's very hacky, but that's really what comes to mind.
For the easier int8 case, you may be able to use PyTorch's convert API that I mentioned earlier, and either:
- If it works, the easiest: Find true quantized modules that fit your architecture in PyTorch's library and use those in the convert API mapping. This may happen automatically in the default mappings, which would make your life the easiest.
- "Medium labour intensive way": If the quantized modules don't exist, code them up yourself, but it may involve some C++. Eventually PR them in for accolades.

fabriziojpiva commented 5 months ago

Hi @OscarSavolainenDR!

Many thanks for the quick answer, it is very informative! Let me start with your questions so that I can help you understand better the goal. The hardware I am trying to target is NXP's i.MX 8M Plus embedded device.

Usually, the procedure to deploy a NN in this device implies the following chain of events: Pytorch -----> ONNX -----> TFLite -----> deployment in TFLite. To go from ONNX to TFLite we have our own tool that converts from ONNX to TFLite, so you can leave that out of the equation.

The problem I am currently facing is the first transition, i.e. Pytorch -----> ONNX. In other words, how to take, a INT8 fake quantized model produced by FQ-ViT (or any other PTQ for ViT, it does not really have to be exclusively FQ-ViT), which in reality is in FP32, and convert that to ONNX with INT8 representations instead of FP32.

As for now, I am focusing on INT8 quantization, so there is no need to overcomplicate things with INT4. Because I am a beginner in quantization, I have learnt from your Youtube videos and nice Medium post that there is a whole procedure that needs to be done. But if I start from a model that is already PTQed, I assume there are a lot of steps that I can skip. From what I understood from your tutorials, I would still need to (please correct me if I am wrong):

1) Attach qconfigs before running FQ-ViT:

activation_qconfig = _LearnableFakeQuantize.with_args(     
 observer=tq.MinMaxObserver,       # PTQ observer, FQ-ViT uses minmax  
 quant_min=0,  
 quant_max=255,  
 dtype=torch.quint8,  
 qscheme=torch.per_tensor_affine,  # We specify we want per-tensor affine quantization  
 scale=0.1,                        # Initial qparam scale  
 zero_point=0.0,                   # Initial qparam zero-point  
 use_grad_scaling=True,  
)  
module = deit_tiny_model()  
module.qconfig = tq.QConfig(  
 activation=activation_qconfig,  
 weight=tq.default_observer.with_args(dtype=torch.qint8)  
)

2) Run PTQ, then place the model in eval() mode and convert it to int8:

...run PTQ calibration loop, obtaining model_fp32_fakeq...  
model_fp32_fakeq.eval()  
model_int8 = torch.ao.quantization.convert(model_fp32_fakeq)

3) If some quantized modules do not exist, like QIntLayerNorm, what alternative do I have?

4) Export to ONNX:

torch_input = torch.randn(1, 3, 224, 224, device='cuda')
torch.onnx.export(model_int8, torch_input, "model_int8.onnx",   
                             verbose=True,   
                             export_params=True,  
                             opset_version=13,  
                             do_constant_folding=True,  
                             input_names = ["image"],  
                             output_names = ["output"],  
                             dynamic_axes={"image" : {0 : "batch_size"},  
                                                      "output" : {0 : "batch_size"}}  
                    )

Thanks again for your help, I feel a bit less lost now haha.

OscarSavolainenDR commented 5 months ago

On:

activation_qconfig = _LearnableFakeQuantize.with_args(     
 observer=tq.MinMaxObserver,       # PTQ observer, FQ-ViT uses minmax  
 quant_min=0,  
 quant_max=255,  
 dtype=torch.quint8,  
 qscheme=torch.per_tensor_affine,  # We specify we want per-tensor affine quantization  
 scale=0.1,                        # Initial qparam scale  
 zero_point=0.0,                   # Initial qparam zero-point  
 use_grad_scaling=True,  
)  
module = deit_tiny_model()  
module.qconfig = tq.QConfig(  
 activation=activation_qconfig,  
 weight=tq.default_observer.with_args(dtype=torch.qint8)  
)

So typically you'd want to assign the qocnfigs on a per-module basis, not to the whole model separately. You can assign it to the whole model and the qconfig will propagate down (every submodule, of a module with a qconfig, that doesn't have its own qconfig will be assigned its parents qconfig), but you generally want to have quite fine-grained control of how you do it, and so assign qconfigs more or less individually to each layer. So e.g.

activation_qconfig = _LearnableFakeQuantize.with_args(     
 observer=tq.MinMaxObserver,       # PTQ observer, FQ-ViT uses minmax  
 quant_min=0,  
 quant_max=255,  
 dtype=torch.quint8,  
 qscheme=torch.per_tensor_affine,  # We specify we want per-tensor affine quantization  
 scale=0.1,                        # Initial qparam scale  
 zero_point=0.0,                   # Initial qparam zero-point  
 use_grad_scaling=True,  
)  
model = deit_tiny_model()  
for module in model.modules():
  # This assignment can be customised for each module type, by name, etc.
  # I've also never used the tq.default_observer in production before, not sure how it will play, 
  # I normally also use an explicit LearnableFakeQuantize for that as well, 
  # e.g. [here](https://github.com/OscarSavolainenDR/Quantization-Tutorials/blob/ac69533b40dd46888caceebf7ee42441efe2ada9/Resnet-FX-QAT/main.py#L37C1-L37C67).
  module.qconfig = tq.QConfig(  
   activation=activation_qconfig,  
   weight=tq.default_observer.with_args(dtype=torch.qint8)  
  )

Generally, the flow might be simpler than that, as you can go from fake-quant to ONNX directly (i.e. from start of 2 to 4 directly). It's not necessary to go via the PyTorch conversion backend.

However, I don't know if you'll have any issues with the custom layers in ONNX, I haven't had that issue before, I've always used native PyTorch layers that I was able to convert one way or another. If you do have issues, you may want to go via the conversion route first as in your listed plan.

To swap out QIntLayerNorm in conversion, you might want to check out the PyTorch "true" int8 LayerNorm module. I haven't used that layer explicitly before, so not sure if it's exactly what you want. But I'd read through the QInt8LayerNorm and the PyTorch converted LayerNorm modules and see if their forward calls are a match. If so, you can add the converted LayerNorm as the target in your mapping, e.g.

        mapping = get_default_static_quant_module_mappings()
        mapping[QIntLayerNorm] = torch.ao.nn.quantized.modules.LayerNorm
        converted_model = torch.quantization.convert(
            fake_quant_model, inplace=False, mapping=mapping
        )

Since the QIntLayerNorm isn't a native PyTorch module, you may have to do a bit of hacking to have it convert properly, e.g. I think using the from_reference method may come in quite useful. If using the mapping in conversion doesn't quite work, you can also swap out the module directly into your converted model by first producing the converted layer (e.g. via the from_reference method) and manually inserting it into the whole converted model.

Hopefully that helps!

fabriziojpiva commented 4 months ago

Hi Oscar,

I wanted to make sure that I experimented enough before replying to your suggestions, which are quite valuable.

Let me remind you that FQ-ViT performs PTQ yielding a fully (fake) quantized model in INT8. I have been able to convert this model from Pytorch to ONNX, and then from ONNX to TFLite. Because the Pytorch model is fake quantized, the resulting TFLite model is still in FP32 format, which is not possible to deploy in the target hardware as it supports only INT8. I might add that I run evaluation on the model after every conversion and everything went fine, so the TFLite (FP32) model runs and works.

I tried to solve this issue by running PTQ again in TFLite but the model gets completely destroyed, achieving 0% accuracy. At this point I don't know how to really go further, do you think that it is possible to convert the Pytorch model's tensors to native INT8 tensors (instead of FP32), after running PTQ in Pytorch and just before saving the .pth model? This way I could get a non-fake quantized model in Pytorch and then when I go to ONNX and TFLite the entire graph is operating with integers.

If you have any other (better) suggestions, I would really appreciate it hearing from you.

Thanks for your time!

OscarSavolainenDR commented 4 months ago

Yep, so you can convert the model to INT8 inside of PyTprch via the conversion API I mentioned earlier. There's an example here. You may want to experiment with the quantization backend as well, depending on what hardware you're targetting (from this webpage):

You could also probably do this inside of ONNX Runtime. I haven't done that personally, but looking at the code it seems like the quantizer.quantize_model() API should do the job, but it might be a bit fidgety to format everything.

fabriziojpiva commented 4 months ago

Hi Oscar,

Thanks for the quick and concise answer. Unfortunately it seems like torch.ao.quantization.convert must receive a model with the standard Pytorch qconfigs. The reason why I say this is because the function runs perfectly without errors, but when I try to use the tool onnx2tf to convert the (supposed to be INT8) model to TFLite, then the log looks like this:

Model conversion started ============================================================
INFO: input_op_name: image shape: [1, 3, 224, 224] dtype: float32

INFO: 2 / 2593
INFO: onnx_op_type: Div onnx_op_name: Div_1
INFO:  input_name.1: image shape: [1, 3, 224, 224] dtype: float32
INFO:  input_name.2: 154 shape: [1, 1, 1, 1] dtype: float32
INFO:  output_name.1: 155 shape: [1, 3, 224, 224] dtype: float32
INFO: tf_op_type: divide
INFO:  input.1.x: name: image shape: (1, 224, 224, 3) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 1, 1) dtype: float32
INFO:  output.1.output: name: tf.math.divide/truediv:0 shape: (1, 224, 224, 3) dtype: <dtype: 'float32'>

INFO: 3 / 2593
INFO: onnx_op_type: Add onnx_op_name: Add_3
INFO:  input_name.1: 155 shape: [1, 3, 224, 224] dtype: float32
INFO:  input_name.2: 156 shape: [1, 1, 1, 1] dtype: float32
INFO:  output_name.1: 157 shape: [1, 3, 224, 224] dtype: float32
INFO: tf_op_type: add
INFO:  input.1.x: name: tf.math.divide/truediv:0 shape: (1, 224, 224, 3) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 1, 1) dtype: float32
INFO:  output.1.output: name: tf.math.add_1/Add:0 shape: (1, 224, 224, 3) dtype: <dtype: 'float32'>

INFO: 4 / 2593
INFO: onnx_op_type: Round onnx_op_name: Round_4
INFO:  input_name.1: 157 shape: [1, 3, 224, 224] dtype: float32
INFO:  output_name.1: 158 shape: [1, 3, 224, 224] dtype: float32
INFO: tf_op_type: round
INFO:  input.1.x: name: tf.math.add_1/Add:0 shape: (1, 224, 224, 3) dtype: <dtype: 'float32'>
INFO:  output.1.output: name: tf.math.round/Round:0 shape: (1, 224, 224, 3) dtype: <dtype: 'float32'>

INFO: 5 / 2593
INFO: onnx_op_type: Clip onnx_op_name: Clip_5
INFO:  input_name.1: 158 shape: [1, 3, 224, 224] dtype: float32
INFO:  input_name.2: 6513 shape: [] dtype: float32
INFO:  input_name.3: 6514 shape: [] dtype: float32
INFO:  output_name.1: 163 shape: [1, 3, 224, 224] dtype: float32
INFO: tf_op_type: clip_by_value
INFO:  input.1.features: name: tf.math.round/Round:0 shape: (1, 224, 224, 3) dtype: <dtype: 'float32'>
INFO:  input.2.min_value: shape: () dtype: float32
INFO:  input.3.max_value: shape: () dtype: float32
INFO:  output.1.output: name: tf.clip_by_value/clip_by_value:0 shape: (1, 224, 224, 3) dtype: <dtype: 'float32'>

INFO: 6 / 2593
INFO: onnx_op_type: Sub onnx_op_name: Sub_7
INFO:  input_name.1: 163 shape: [1, 3, 224, 224] dtype: float32
INFO:  input_name.2: 156 shape: (1, 1, 1, 1) dtype: float32
INFO:  output_name.1: 165 shape: [1, 3, 224, 224] dtype: float32
INFO: tf_op_type: subtract
INFO:  input.1.x: name: tf.clip_by_value/clip_by_value:0 shape: (1, 224, 224, 3) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 1, 1) dtype: float32
INFO:  output.1.output: name: tf.math.subtract_1/Sub:0 shape: (1, 224, 224, 3) dtype: <dtype: 'float32'>

INFO: 7 / 2593
INFO: onnx_op_type: Mul onnx_op_name: Mul_9
INFO:  input_name.1: 165 shape: [1, 3, 224, 224] dtype: float32
INFO:  input_name.2: 154 shape: (1, 1, 1, 1) dtype: float32
INFO:  output_name.1: 167 shape: [1, 3, 224, 224] dtype: float32
INFO: tf_op_type: multiply
INFO:  input.1.x: name: tf.math.subtract_1/Sub:0 shape: (1, 224, 224, 3) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 1, 1) dtype: float32
INFO:  output.1.output: name: tf.math.multiply_5/Mul:0 shape: (1, 224, 224, 3) dtype: <dtype: 'float32'>

INFO: 8 / 2593
INFO: onnx_op_type: Conv onnx_op_name: Conv_16
INFO:  input_name.1: 167 shape: [1, 3, 224, 224] dtype: float32
INFO:  input_name.2: 181 shape: [192, 3, 16, 16] dtype: float32
INFO:  input_name.3: patch_embed.proj.bias shape: [192] dtype: float32
INFO:  output_name.1: 182 shape: [1, 192, 14, 14] dtype: float32
INFO: tf_op_type: convolution_v2
INFO:  input.1.input: name: tf.math.multiply_5/Mul:0 shape: (1, 224, 224, 3) dtype: <dtype: 'float32'>
INFO:  input.2.weights: shape: (16, 16, 3, 192) dtype: <dtype: 'float32'>
INFO:  input.3.bias: shape: (192,) dtype: <dtype: 'float32'>
INFO:  input.4.strides: val: [16, 16]
INFO:  input.5.dilations: val: [1, 1]
INFO:  input.6.padding: val: VALID
INFO:  input.7.group: val: 1
INFO:  output.1.output: name: tf.math.add_2/Add:0 shape: (1, 14, 14, 192) dtype: <dtype: 'float32'>

INFO: 9 / 2593
INFO: onnx_op_type: Reshape onnx_op_name: Reshape_24
INFO:  input_name.1: 182 shape: [1, 192, 14, 14] dtype: float32
INFO:  input_name.2: 6450 shape: [3] dtype: int64
INFO:  output_name.1: 190 shape: [1, 192, 196] dtype: float32
INFO: tf_op_type: reshape
INFO:  input.1.tensor: name: tf.compat.v1.transpose_2/transpose:0 shape: (1, 192, 14, 14) dtype: <dtype: 'float32'>
INFO:  input.2.shape: val: [1, 192, -1]
INFO:  output.1.output: name: tf.reshape_2/Reshape:0 shape: (1, 192, 196) dtype: <dtype: 'float32'>

INFO: 10 / 2593
INFO: onnx_op_type: Transpose onnx_op_name: Transpose_25
INFO:  input_name.1: 190 shape: [1, 192, 196] dtype: float32
INFO:  output_name.1: 191 shape: [1, 196, 192] dtype: float32
INFO: tf_op_type: transpose_v2
INFO:  input.1.a: name: tf.reshape_2/Reshape:0 shape: (1, 192, 196) dtype: <dtype: 'float32'>
INFO:  input.2.perm: val: [0, 2, 1]
INFO:  output.1.output: name: tf.compat.v1.transpose_5/transpose:0 shape: (1, 196, 192) dtype: <dtype: 'float32'>

INFO: 11 / 2593
INFO: onnx_op_type: Div onnx_op_name: Div_27
INFO:  input_name.1: 191 shape: [1, 196, 192] dtype: float32
INFO:  input_name.2: 192 shape: [1, 1, 1] dtype: float32
INFO:  output_name.1: 193 shape: [1, 196, 192] dtype: float32
INFO: tf_op_type: divide
INFO:  input.1.x: name: tf.compat.v1.transpose_5/transpose:0 shape: (1, 196, 192) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 1) dtype: float32
INFO:  output.1.output: name: tf.math.divide_1/truediv:0 shape: (1, 196, 192) dtype: <dtype: 'float32'>

INFO: 12 / 2593
INFO: onnx_op_type: Add onnx_op_name: Add_29
INFO:  input_name.1: 193 shape: [1, 196, 192] dtype: float32
INFO:  input_name.2: 194 shape: [1, 1, 1] dtype: float32
INFO:  output_name.1: 195 shape: [1, 196, 192] dtype: float32
INFO: tf_op_type: add
INFO:  input.1.x: name: tf.math.divide_1/truediv:0 shape: (1, 196, 192) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 1) dtype: float32
INFO:  output.1.output: name: tf.math.add_3/Add:0 shape: (1, 196, 192) dtype: <dtype: 'float32'>

INFO: 13 / 2593
INFO: onnx_op_type: Round onnx_op_name: Round_30
INFO:  input_name.1: 195 shape: [1, 196, 192] dtype: float32
INFO:  output_name.1: 196 shape: [1, 196, 192] dtype: float32
INFO: tf_op_type: round
INFO:  input.1.x: name: tf.math.add_3/Add:0 shape: (1, 196, 192) dtype: <dtype: 'float32'>
INFO:  output.1.output: name: tf.math.round_1/Round:0 shape: (1, 196, 192) dtype: <dtype: 'float32'>

INFO: 14 / 2593
INFO: onnx_op_type: Clip onnx_op_name: Clip_31
INFO:  input_name.1: 196 shape: [1, 196, 192] dtype: float32
INFO:  input_name.2: 6513 shape: () dtype: float32
INFO:  input_name.3: 6514 shape: () dtype: float32
INFO:  output_name.1: 201 shape: [1, 196, 192] dtype: float32
INFO: tf_op_type: clip_by_value
INFO:  input.1.features: name: tf.math.round_1/Round:0 shape: (1, 196, 192) dtype: <dtype: 'float32'>
INFO:  input.2.min_value: shape: () dtype: float32
INFO:  input.3.max_value: shape: () dtype: float32
INFO:  output.1.output: name: tf.clip_by_value_1/clip_by_value:0 shape: (1, 196, 192) dtype: <dtype: 'float32'>

INFO: 15 / 2593
INFO: onnx_op_type: Sub onnx_op_name: Sub_33
INFO:  input_name.1: 201 shape: [1, 196, 192] dtype: float32
INFO:  input_name.2: 194 shape: (1, 1, 1) dtype: float32
INFO:  output_name.1: 203 shape: [1, 196, 192] dtype: float32
INFO: tf_op_type: subtract
INFO:  input.1.x: name: tf.clip_by_value_1/clip_by_value:0 shape: (1, 196, 192) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 1) dtype: float32
INFO:  output.1.output: name: tf.math.subtract_2/Sub:0 shape: (1, 196, 192) dtype: <dtype: 'float32'>

INFO: 16 / 2593
INFO: onnx_op_type: Mul onnx_op_name: Mul_35
INFO:  input_name.1: 203 shape: [1, 196, 192] dtype: float32
INFO:  input_name.2: 192 shape: (1, 1, 1) dtype: float32
INFO:  output_name.1: 205 shape: [1, 196, 192] dtype: float32
INFO: tf_op_type: multiply
INFO:  input.1.x: name: tf.math.subtract_2/Sub:0 shape: (1, 196, 192) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 1) dtype: float32
INFO:  output.1.output: name: tf.math.multiply_14/Mul:0 shape: (1, 196, 192) dtype: <dtype: 'float32'>

INFO: 17 / 2593
INFO: onnx_op_type: Concat onnx_op_name: Concat_42
INFO:  input_name.1: 223 shape: [1, 1, 192] dtype: float32
INFO:  input_name.2: 205 shape: [1, 196, 192] dtype: float32
INFO:  output_name.1: 224 shape: [1, 197, 192] dtype: float32
INFO: tf_op_type: concat
INFO:  input.1.input0: shape: (1, 1, 192) dtype: <dtype: 'float32'>
INFO:  input.2.input1: name: tf.math.multiply_14/Mul:0 shape: (1, 196, 192) dtype: <dtype: 'float32'>
INFO:  input.3.axis: val: 1
INFO:  output.1.output: name: tf.concat_2/concat:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>

INFO: 18 / 2593
INFO: onnx_op_type: Div onnx_op_name: Div_44
INFO:  input_name.1: 224 shape: [1, 197, 192] dtype: float32
INFO:  input_name.2: 192 shape: (1, 1, 1) dtype: float32
INFO:  output_name.1: 226 shape: [1, 197, 192] dtype: float32
INFO: tf_op_type: divide
INFO:  input.1.x: name: tf.concat_2/concat:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 1) dtype: float32
INFO:  output.1.output: name: tf.math.divide_2/truediv:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>

INFO: 19 / 2593
INFO: onnx_op_type: Add onnx_op_name: Add_46
INFO:  input_name.1: 226 shape: [1, 197, 192] dtype: float32
INFO:  input_name.2: 194 shape: (1, 1, 1) dtype: float32
INFO:  output_name.1: 228 shape: [1, 197, 192] dtype: float32
INFO: tf_op_type: add
INFO:  input.1.x: name: tf.math.divide_2/truediv:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 1) dtype: float32
INFO:  output.1.output: name: tf.math.add_4/Add:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>

INFO: 20 / 2593
INFO: onnx_op_type: Round onnx_op_name: Round_47
INFO:  input_name.1: 228 shape: [1, 197, 192] dtype: float32
INFO:  output_name.1: 229 shape: [1, 197, 192] dtype: float32
INFO: tf_op_type: round
INFO:  input.1.x: name: tf.math.add_4/Add:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>
INFO:  output.1.output: name: tf.math.round_2/Round:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>

INFO: 21 / 2593
INFO: onnx_op_type: Clip onnx_op_name: Clip_48
INFO:  input_name.1: 229 shape: [1, 197, 192] dtype: float32
INFO:  input_name.2: 6513 shape: () dtype: float32
INFO:  input_name.3: 6514 shape: () dtype: float32
INFO:  output_name.1: 234 shape: [1, 197, 192] dtype: float32
INFO: tf_op_type: clip_by_value
INFO:  input.1.features: name: tf.math.round_2/Round:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>
INFO:  input.2.min_value: shape: () dtype: float32
INFO:  input.3.max_value: shape: () dtype: float32
INFO:  output.1.output: name: tf.clip_by_value_2/clip_by_value:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>

INFO: 22 / 2593
INFO: onnx_op_type: Sub onnx_op_name: Sub_50
INFO:  input_name.1: 234 shape: [1, 197, 192] dtype: float32
INFO:  input_name.2: 194 shape: (1, 1, 1) dtype: float32
INFO:  output_name.1: 236 shape: [1, 197, 192] dtype: float32
INFO: tf_op_type: subtract
INFO:  input.1.x: name: tf.clip_by_value_2/clip_by_value:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 1) dtype: float32
INFO:  output.1.output: name: tf.math.subtract_3/Sub:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>

INFO: 23 / 2593
INFO: onnx_op_type: Mul onnx_op_name: Mul_52
INFO:  input_name.1: 236 shape: [1, 197, 192] dtype: float32
INFO:  input_name.2: 192 shape: (1, 1, 1) dtype: float32
INFO:  output_name.1: 238 shape: [1, 197, 192] dtype: float32
INFO: tf_op_type: multiply
INFO:  input.1.x: name: tf.math.subtract_3/Sub:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 1) dtype: float32
INFO:  output.1.output: name: tf.math.multiply_23/Mul:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>

INFO: 24 / 2593
INFO: onnx_op_type: Add onnx_op_name: Add_59
INFO:  input_name.1: 238 shape: [1, 197, 192] dtype: float32
INFO:  input_name.2: 252 shape: [1, 197, 192] dtype: float32
INFO:  output_name.1: 253 shape: [1, 197, 192] dtype: float32
INFO: tf_op_type: add
INFO:  input.1.x: name: tf.math.multiply_23/Mul:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 197, 192) dtype: float32
INFO:  output.1.output: name: tf.math.add_5/Add:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>

INFO: 25 / 2593
INFO: onnx_op_type: Div onnx_op_name: Div_61
INFO:  input_name.1: 253 shape: [1, 197, 192] dtype: float32
INFO:  input_name.2: 254 shape: [1, 1, 192] dtype: float32
INFO:  output_name.1: 255 shape: [1, 197, 192] dtype: float32
INFO: tf_op_type: divide
INFO:  input.1.x: name: tf.math.add_5/Add:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 192) dtype: float32
INFO:  output.1.output: name: tf.math.divide_3/truediv:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>

INFO: 26 / 2593
INFO: onnx_op_type: Add onnx_op_name: Add_63
INFO:  input_name.1: 255 shape: [1, 197, 192] dtype: float32
INFO:  input_name.2: 256 shape: [1, 1, 1] dtype: float32
INFO:  output_name.1: 257 shape: [1, 197, 192] dtype: float32
INFO: tf_op_type: add
INFO:  input.1.x: name: tf.math.divide_3/truediv:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 1) dtype: float32
INFO:  output.1.output: name: tf.math.add_6/Add:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>

INFO: 27 / 2593
INFO: onnx_op_type: Round onnx_op_name: Round_64
INFO:  input_name.1: 257 shape: [1, 197, 192] dtype: float32
INFO:  output_name.1: 258 shape: [1, 197, 192] dtype: float32
INFO: tf_op_type: round
INFO:  input.1.x: name: tf.math.add_6/Add:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>
INFO:  output.1.output: name: tf.math.round_3/Round:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>

INFO: 28 / 2593
INFO: onnx_op_type: Clip onnx_op_name: Clip_65
INFO:  input_name.1: 258 shape: [1, 197, 192] dtype: float32
INFO:  input_name.2: 6513 shape: () dtype: float32
INFO:  input_name.3: 6514 shape: () dtype: float32
INFO:  output_name.1: 263 shape: [1, 197, 192] dtype: float32
INFO: tf_op_type: clip_by_value
INFO:  input.1.features: name: tf.math.round_3/Round:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>
INFO:  input.2.min_value: shape: () dtype: float32
INFO:  input.3.max_value: shape: () dtype: float32
INFO:  output.1.output: name: tf.clip_by_value_3/clip_by_value:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>

INFO: 29 / 2593
INFO: onnx_op_type: Sub onnx_op_name: Sub_67
INFO:  input_name.1: 263 shape: [1, 197, 192] dtype: float32
INFO:  input_name.2: 256 shape: (1, 1, 1) dtype: float32
INFO:  output_name.1: 265 shape: [1, 197, 192] dtype: float32
INFO: tf_op_type: subtract
INFO:  input.1.x: name: tf.clip_by_value_3/clip_by_value:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 1) dtype: float32
INFO:  output.1.output: name: tf.math.subtract_4/Sub:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>

INFO: 30 / 2593
INFO: onnx_op_type: Mul onnx_op_name: Mul_69
INFO:  input_name.1: 265 shape: [1, 197, 192] dtype: float32
INFO:  input_name.2: 254 shape: (1, 1, 192) dtype: float32
INFO:  output_name.1: 267 shape: [1, 197, 192] dtype: float32
INFO: tf_op_type: multiply
INFO:  input.1.x: name: tf.math.subtract_4/Sub:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 192) dtype: float32
INFO:  output.1.output: name: tf.math.multiply_34/Mul:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>

INFO: 31 / 2593
INFO: onnx_op_type: Div onnx_op_name: Div_71
INFO:  input_name.1: 267 shape: [1, 197, 192] dtype: float32
INFO:  input_name.2: 254 shape: (1, 1, 192) dtype: float32
INFO:  output_name.1: 269 shape: [1, 197, 192] dtype: float32
INFO: tf_op_type: divide
INFO:  input.1.x: name: tf.math.multiply_34/Mul:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>
INFO:  input.2.y: shape: (1, 1, 192) dtype: float32
INFO:  output.1.output: name: tf.math.divide_4/truediv:0 shape: (1, 197, 192) dtype: <dtype: 'float32'>

As you can see, most operations stayed in FP32. Thus it is very clear that torch.ao.quantization.convert did not convert anything and just returned the same FP32 model. Just in case, after running torch.ao.quantization.convert, this is the result of printing the "INT8" model:

VisionTransformer(
  (qact_input): QAct(
    (quantizer): UniformQuantizer()
  )
  (patch_embed): PatchEmbed(
    (proj): QConv2d(
      3, 192, kernel_size=(16, 16), stride=(16, 16)
      (quantizer): UniformQuantizer()
    )
    (qact_before_norm): Identity()
    (norm): Identity()
    (qact): QAct(
      (quantizer): UniformQuantizer()
    )
  )
  (pos_drop): Dropout(p=0.0, inplace=False)
  (qact_embed): QAct(
    (quantizer): UniformQuantizer()
  )
  (qact_pos): QAct(
    (quantizer): UniformQuantizer()
  )
  (qact1): QAct(
    (quantizer): UniformQuantizer()
  )
  (blocks): ModuleList(
    (0): Block(
      (norm1): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact1): QAct(
        (quantizer): UniformQuantizer()
      )
      (attn): Attention(
        (qkv): QLinear(
          in_features=192, out_features=576, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (proj): QLinear(
          in_features=192, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact3): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact_attn1): QAct(
          (quantizer): UniformQuantizer()
        )
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj_drop): Dropout(p=0.0, inplace=False)
        (log_int_softmax): QIntSoftmax(
          (quantizer): Log2Quantizer()
        )
      )
      (drop_path): Identity()
      (qact2): QAct(
        (quantizer): UniformQuantizer()
      )
      (norm2): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact3): QAct(
        (quantizer): UniformQuantizer()
      )
      (mlp): Mlp(
        (fc1): QLinear(
          in_features=192, out_features=768, bias=True
          (quantizer): UniformQuantizer()
        )
        (act): GELU()
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (fc2): QLinear(
          in_features=768, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (drop): Dropout(p=0.0, inplace=False)
      )
      (qact4): QAct(
        (quantizer): UniformQuantizer()
      )
    )
    (1): Block(
      (norm1): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact1): QAct(
        (quantizer): UniformQuantizer()
      )
      (attn): Attention(
        (qkv): QLinear(
          in_features=192, out_features=576, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (proj): QLinear(
          in_features=192, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact3): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact_attn1): QAct(
          (quantizer): UniformQuantizer()
        )
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj_drop): Dropout(p=0.0, inplace=False)
        (log_int_softmax): QIntSoftmax(
          (quantizer): Log2Quantizer()
        )
      )
      (drop_path): Identity()
      (qact2): QAct(
        (quantizer): UniformQuantizer()
      )
      (norm2): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact3): QAct(
        (quantizer): UniformQuantizer()
      )
      (mlp): Mlp(
        (fc1): QLinear(
          in_features=192, out_features=768, bias=True
          (quantizer): UniformQuantizer()
        )
        (act): GELU()
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (fc2): QLinear(
          in_features=768, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (drop): Dropout(p=0.0, inplace=False)
      )
      (qact4): QAct(
        (quantizer): UniformQuantizer()
      )
    )
    (2): Block(
      (norm1): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact1): QAct(
        (quantizer): UniformQuantizer()
      )
      (attn): Attention(
        (qkv): QLinear(
          in_features=192, out_features=576, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (proj): QLinear(
          in_features=192, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact3): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact_attn1): QAct(
          (quantizer): UniformQuantizer()
        )
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj_drop): Dropout(p=0.0, inplace=False)
        (log_int_softmax): QIntSoftmax(
          (quantizer): Log2Quantizer()
        )
      )
      (drop_path): Identity()
      (qact2): QAct(
        (quantizer): UniformQuantizer()
      )
      (norm2): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact3): QAct(
        (quantizer): UniformQuantizer()
      )
      (mlp): Mlp(
        (fc1): QLinear(
          in_features=192, out_features=768, bias=True
          (quantizer): UniformQuantizer()
        )
        (act): GELU()
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (fc2): QLinear(
          in_features=768, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (drop): Dropout(p=0.0, inplace=False)
      )
      (qact4): QAct(
        (quantizer): UniformQuantizer()
      )
    )
    (3): Block(
      (norm1): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact1): QAct(
        (quantizer): UniformQuantizer()
      )
      (attn): Attention(
        (qkv): QLinear(
          in_features=192, out_features=576, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (proj): QLinear(
          in_features=192, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact3): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact_attn1): QAct(
          (quantizer): UniformQuantizer()
        )
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj_drop): Dropout(p=0.0, inplace=False)
        (log_int_softmax): QIntSoftmax(
          (quantizer): Log2Quantizer()
        )
      )
      (drop_path): Identity()
      (qact2): QAct(
        (quantizer): UniformQuantizer()
      )
      (norm2): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact3): QAct(
        (quantizer): UniformQuantizer()
      )
      (mlp): Mlp(
        (fc1): QLinear(
          in_features=192, out_features=768, bias=True
          (quantizer): UniformQuantizer()
        )
        (act): GELU()
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (fc2): QLinear(
          in_features=768, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (drop): Dropout(p=0.0, inplace=False)
      )
      (qact4): QAct(
        (quantizer): UniformQuantizer()
      )
    )
    (4): Block(
      (norm1): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact1): QAct(
        (quantizer): UniformQuantizer()
      )
      (attn): Attention(
        (qkv): QLinear(
          in_features=192, out_features=576, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (proj): QLinear(
          in_features=192, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact3): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact_attn1): QAct(
          (quantizer): UniformQuantizer()
        )
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj_drop): Dropout(p=0.0, inplace=False)
        (log_int_softmax): QIntSoftmax(
          (quantizer): Log2Quantizer()
        )
      )
      (drop_path): Identity()
      (qact2): QAct(
        (quantizer): UniformQuantizer()
      )
      (norm2): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact3): QAct(
        (quantizer): UniformQuantizer()
      )
      (mlp): Mlp(
        (fc1): QLinear(
          in_features=192, out_features=768, bias=True
          (quantizer): UniformQuantizer()
        )
        (act): GELU()
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (fc2): QLinear(
          in_features=768, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (drop): Dropout(p=0.0, inplace=False)
      )
      (qact4): QAct(
        (quantizer): UniformQuantizer()
      )
    )
    (5): Block(
      (norm1): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact1): QAct(
        (quantizer): UniformQuantizer()
      )
      (attn): Attention(
        (qkv): QLinear(
          in_features=192, out_features=576, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (proj): QLinear(
          in_features=192, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact3): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact_attn1): QAct(
          (quantizer): UniformQuantizer()
        )
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj_drop): Dropout(p=0.0, inplace=False)
        (log_int_softmax): QIntSoftmax(
          (quantizer): Log2Quantizer()
        )
      )
      (drop_path): Identity()
      (qact2): QAct(
        (quantizer): UniformQuantizer()
      )
      (norm2): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact3): QAct(
        (quantizer): UniformQuantizer()
      )
      (mlp): Mlp(
        (fc1): QLinear(
          in_features=192, out_features=768, bias=True
          (quantizer): UniformQuantizer()
        )
        (act): GELU()
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (fc2): QLinear(
          in_features=768, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (drop): Dropout(p=0.0, inplace=False)
      )
      (qact4): QAct(
        (quantizer): UniformQuantizer()
      )
    )
    (6): Block(
      (norm1): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact1): QAct(
        (quantizer): UniformQuantizer()
      )
      (attn): Attention(
        (qkv): QLinear(
          in_features=192, out_features=576, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (proj): QLinear(
          in_features=192, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact3): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact_attn1): QAct(
          (quantizer): UniformQuantizer()
        )
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj_drop): Dropout(p=0.0, inplace=False)
        (log_int_softmax): QIntSoftmax(
          (quantizer): Log2Quantizer()
        )
      )
      (drop_path): Identity()
      (qact2): QAct(
        (quantizer): UniformQuantizer()
      )
      (norm2): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact3): QAct(
        (quantizer): UniformQuantizer()
      )
      (mlp): Mlp(
        (fc1): QLinear(
          in_features=192, out_features=768, bias=True
          (quantizer): UniformQuantizer()
        )
        (act): GELU()
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (fc2): QLinear(
          in_features=768, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (drop): Dropout(p=0.0, inplace=False)
      )
      (qact4): QAct(
        (quantizer): UniformQuantizer()
      )
    )
    (7): Block(
      (norm1): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact1): QAct(
        (quantizer): UniformQuantizer()
      )
      (attn): Attention(
        (qkv): QLinear(
          in_features=192, out_features=576, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (proj): QLinear(
          in_features=192, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact3): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact_attn1): QAct(
          (quantizer): UniformQuantizer()
        )
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj_drop): Dropout(p=0.0, inplace=False)
        (log_int_softmax): QIntSoftmax(
          (quantizer): Log2Quantizer()
        )
      )
      (drop_path): Identity()
      (qact2): QAct(
        (quantizer): UniformQuantizer()
      )
      (norm2): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact3): QAct(
        (quantizer): UniformQuantizer()
      )
      (mlp): Mlp(
        (fc1): QLinear(
          in_features=192, out_features=768, bias=True
          (quantizer): UniformQuantizer()
        )
        (act): GELU()
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (fc2): QLinear(
          in_features=768, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (drop): Dropout(p=0.0, inplace=False)
      )
      (qact4): QAct(
        (quantizer): UniformQuantizer()
      )
    )
    (8): Block(
      (norm1): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact1): QAct(
        (quantizer): UniformQuantizer()
      )
      (attn): Attention(
        (qkv): QLinear(
          in_features=192, out_features=576, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (proj): QLinear(
          in_features=192, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact3): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact_attn1): QAct(
          (quantizer): UniformQuantizer()
        )
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj_drop): Dropout(p=0.0, inplace=False)
        (log_int_softmax): QIntSoftmax(
          (quantizer): Log2Quantizer()
        )
      )
      (drop_path): Identity()
      (qact2): QAct(
        (quantizer): UniformQuantizer()
      )
      (norm2): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact3): QAct(
        (quantizer): UniformQuantizer()
      )
      (mlp): Mlp(
        (fc1): QLinear(
          in_features=192, out_features=768, bias=True
          (quantizer): UniformQuantizer()
        )
        (act): GELU()
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (fc2): QLinear(
          in_features=768, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (drop): Dropout(p=0.0, inplace=False)
      )
      (qact4): QAct(
        (quantizer): UniformQuantizer()
      )
    )
    (9): Block(
      (norm1): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact1): QAct(
        (quantizer): UniformQuantizer()
      )
      (attn): Attention(
        (qkv): QLinear(
          in_features=192, out_features=576, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (proj): QLinear(
          in_features=192, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact3): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact_attn1): QAct(
          (quantizer): UniformQuantizer()
        )
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj_drop): Dropout(p=0.0, inplace=False)
        (log_int_softmax): QIntSoftmax(
          (quantizer): Log2Quantizer()
        )
      )
      (drop_path): Identity()
      (qact2): QAct(
        (quantizer): UniformQuantizer()
      )
      (norm2): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact3): QAct(
        (quantizer): UniformQuantizer()
      )
      (mlp): Mlp(
        (fc1): QLinear(
          in_features=192, out_features=768, bias=True
          (quantizer): UniformQuantizer()
        )
        (act): GELU()
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (fc2): QLinear(
          in_features=768, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (drop): Dropout(p=0.0, inplace=False)
      )
      (qact4): QAct(
        (quantizer): UniformQuantizer()
      )
    )
    (10): Block(
      (norm1): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact1): QAct(
        (quantizer): UniformQuantizer()
      )
      (attn): Attention(
        (qkv): QLinear(
          in_features=192, out_features=576, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (proj): QLinear(
          in_features=192, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact3): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact_attn1): QAct(
          (quantizer): UniformQuantizer()
        )
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj_drop): Dropout(p=0.0, inplace=False)
        (log_int_softmax): QIntSoftmax(
          (quantizer): Log2Quantizer()
        )
      )
      (drop_path): Identity()
      (qact2): QAct(
        (quantizer): UniformQuantizer()
      )
      (norm2): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact3): QAct(
        (quantizer): UniformQuantizer()
      )
      (mlp): Mlp(
        (fc1): QLinear(
          in_features=192, out_features=768, bias=True
          (quantizer): UniformQuantizer()
        )
        (act): GELU()
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (fc2): QLinear(
          in_features=768, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (drop): Dropout(p=0.0, inplace=False)
      )
      (qact4): QAct(
        (quantizer): UniformQuantizer()
      )
    )
    (11): Block(
      (norm1): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact1): QAct(
        (quantizer): UniformQuantizer()
      )
      (attn): Attention(
        (qkv): QLinear(
          in_features=192, out_features=576, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (proj): QLinear(
          in_features=192, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact3): QAct(
          (quantizer): UniformQuantizer()
        )
        (qact_attn1): QAct(
          (quantizer): UniformQuantizer()
        )
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj_drop): Dropout(p=0.0, inplace=False)
        (log_int_softmax): QIntSoftmax(
          (quantizer): Log2Quantizer()
        )
      )
      (drop_path): Identity()
      (qact2): QAct(
        (quantizer): UniformQuantizer()
      )
      (norm2): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
      (qact3): QAct(
        (quantizer): UniformQuantizer()
      )
      (mlp): Mlp(
        (fc1): QLinear(
          in_features=192, out_features=768, bias=True
          (quantizer): UniformQuantizer()
        )
        (act): GELU()
        (qact1): QAct(
          (quantizer): UniformQuantizer()
        )
        (fc2): QLinear(
          in_features=768, out_features=192, bias=True
          (quantizer): UniformQuantizer()
        )
        (qact2): QAct(
          (quantizer): UniformQuantizer()
        )
        (drop): Dropout(p=0.0, inplace=False)
      )
      (qact4): QAct(
        (quantizer): UniformQuantizer()
      )
    )
  )
  (norm): QIntLayerNorm((192,), eps=1e-06, elementwise_affine=True)
  (qact2): QAct(
    (quantizer): UniformQuantizer()
  )
  (pre_logits): Identity()
  (head): QLinear(
    in_features=192, out_features=1000, bias=True
    (quantizer): UniformQuantizer()
  )
  (act_out): QAct(
    (quantizer): UniformQuantizer()
  )
)

At this point, do you think that casting every tensor to INT8 after running PTQ in Pytorch is the only way to go? It seems to be a pain trying to deploy these models that are quantized in a custom way, i.e. without using Pytorch quantization tools.

OscarSavolainenDR commented 4 months ago

Yeah that seems a bit tricky. Honestly, quantization can be a bit hacky when you get to the limits of the APIs like you're doing. I think either converting the tensors one by one, or copying the qparams from the non-PyTorch-native fake-quantised modules to PyTorch native ones, and then converting with the PyTorch API. But at this point, all solutions will be a bit hacky, so I think the best you can do is just find something that works. If this was something you needed to do multiple times then would want to systematise it, but for a one-off thing then hacky is fine as long as it gives the desired result.

OscarSavolainenDR / Quantization-Tutorials

Question about PTQ fake quantized FP32 model to INT8 #12