facebookresearch / kill-the-bits

Code for: "And the bit goes down: Revisiting the quantization of neural networks"
Other
636 stars 123 forks source link

processing time #23

Closed rafikg closed 5 years ago

rafikg commented 5 years ago

Here you reach very good compression rate but what about the inference time? I think it will be much more than before for every forward step, we have to load the weights from the lookup table based on the indices and run the conv operations? I know that you can, for example, upload the model inside a mobile and after that decompress it and start using it for inference (I have read this in another paper...) Can you clarify?

pierrestock commented 5 years ago

Hey DeeperDeeper,

Very interesting and crucial question! The quantization techniques are dependent of the hardware on which they will be implemented. In other words, there are two metrics to be considered:

Optimizing only the first metric (model size) still allows for easy transmission of the model as you mention, for instance when updating the OS of a car or some app on a smartphone. In the paper, we argue that our method can give competitive inference time with some engineering effort as explained below.

The code in inference.py should only be regarded as a proof of concept, we did not optimize this part of the code and the implementation is naive. That is, we recover the compressed accuracies in the paper based on the centroids and the assignments. Assuming we uncompress layers one by one on the fly, the inference time is thus slower than the inference time of the non-compressed network (and the memory footprint is the one of a non-compressed layer, thus quite large).

However, there is a more clever way to perform fast inference based on the centroids and the assignments (again, we specify the method but we did not implement it as it requires more engineering skills and depends on the hardware). We illustrate the idea in the case where the block size is the same as the size of the column (let's say W is a p x m matrix, quantized with m column blocks of size p). The idea is to chunk the input activations x of size b x p into b row blocks of size p and to compte in parallel the scalar products between the activation blocks and the weight blocks. Since we use byte-aligned codebooks (256 centroids = indexing on 1 byte), this should be fast. The case where the block size is smaller is similar but involves a summation part after computing all the scalar products. This way, we never instantiate the full un-compressed matrix as in the naive implementation.

Hope this clarification helps, please reach out for other questions!

Pierre