processing time - Githubissues

Hey DeeperDeeper,

Very interesting and crucial question! The quantization techniques are dependent of the hardware on which they will be implemented. In other words, there are two metrics to be considered:

Model size: the most straightforward metric (because it's easily measurable), which is the size of the compressed weights (sometimes along with the activations size if they are not negligible).
Inference time: compressed models are often required to run in real-time on embedded devices, thus this aspect is crucial.

Optimizing only the first metric (model size) still allows for easy transmission of the model as you mention, for instance when updating the OS of a car or some app on a smartphone. In the paper, we argue that our method can give competitive inference time with some engineering effort as explained below.

The code in inference.py should only be regarded as a proof of concept, we did not optimize this part of the code and the implementation is naive. That is, we recover the compressed accuracies in the paper based on the centroids and the assignments. Assuming we uncompress layers one by one on the fly, the inference time is thus slower than the inference time of the non-compressed network (and the memory footprint is the one of a non-compressed layer, thus quite large).

However, there is a more clever way to perform fast inference based on the centroids and the assignments (again, we specify the method but we did not implement it as it requires more engineering skills and depends on the hardware). We illustrate the idea in the case where the block size is the same as the size of the column (let's say W is a p x m matrix, quantized with m column blocks of size p). The idea is to chunk the input activations x of size b x p into b row blocks of size p and to compte in parallel the scalar products between the activation blocks and the weight blocks. Since we use byte-aligned codebooks (256 centroids = indexing on 1 byte), this should be fast. The case where the block size is smaller is similar but involves a summation part after computing all the scalar products. This way, we never instantiate the full un-compressed matrix as in the naive implementation.

Hope this clarification helps, please reach out for other questions!

Pierre

facebookresearch / kill-the-bits

processing time #23