Vahe1994 / SpQR

Apache License 2.0
515 stars 40 forks source link

Reason for permutation and weights after it is inverted #18

Closed NRodion closed 1 year ago

NRodion commented 1 year ago
  1. Why is weight permutation used in the code and is it mentioned in the paper?
  2. I looked at layer weights after the quantization and they are supposed to have a certain pattern. torch.unique(layer.weight.data[idx,:blocksize]) for any idx should output not more than 2^bit values for any quantization. It works for the original GPTQ code and it also works if I use identity permutation in your code, but it doesn't work for other permutation options and they consistently have blocksize number of values instead of 2^bit. Am I missing something? Outliers can potentially contribute to the total number of unique values, but there cannot be blocksize-2^bit of them (why quantize at all then). Are you sure that the weight matrix is reconstructed correctly?
Godofnothing commented 1 year ago

Hi, @NRodion , thanks for your interest in the project!

1) The weight permutation denotes the order in which weights are quantized. Below you can find the paragraph from our paper. identity means original order of the feature dimensions, and actorderdenotes reordering weights in descending order according to the magnitude of channels.

telegram-cloud-photo-size-2-5235863028916930163-x

2) This behaviour is due to the fact that the weights are quantized in permuted order, i.e you get 2^bit unique values for groupsize weights if they are in corresponding permutation order. Hence, you are expected to find 2^bit for quantized values after imposing the same order of channels used throughout quantization (i.e torch.unique(layer.weight.data[idx, perm][:blocksize])).

NRodion commented 1 year ago

Hi and thank you for the reply.

  1. I expected it to be the reason and I should pay more attention to little details I guess. Permutation also looks more beneficial for models with lower number of parameters. I was using SpQR with Bloom models and perplexity gains were noticeable between identity and act_order.
  2. Doesn't it mean, that you also need to save permutation order in addition to all things, described in Fig. 3? Otherwise, how can you map scales and zeros to their corresponding blocks (blocks are not really blocks at this point since their values are spread around). Anyway, looking forward to the inference code.
Godofnothing commented 1 year ago

1) Permutation order for some models and be crucial and actorder usually makes a difference of order ~0.1 ppl. 2) Yes, we need to save the permutation, so It does incur some additional overhead (as discussed in the AWQ paper).

NRodion commented 1 year ago

Ok, thanks for the replies.