Open CoffeeCat3008871 opened 1 year ago
Thanks. For the first question, according to the implementation of ViT in Timm, we quantize all linear layers and apply HAQ for QKV formulation and after MatMul in multi-head self-attention. For the second question, we apply HAQ in our code to search the optimal bitwidth for each linear and apply the bitwidth for finetuning.
Hi, thank you for sharing your excellent work. In Table 6 of your paper, you showed how simply applying HAQ affects accuracy of several Deit Models comparing to baseline. However, in the supplemental Appendix A., you mentioned that HAQ was only used for "fully-connect layers in vision transformer". Does that mean by applying HAQ there was no quantization search applied to the MatMul part in self-attention layer? And can you please clarify how you apply HAQ in your code since I saw that you used autocast() but not sure how you did search part by using HAQ.