Closed silverlining21 closed 7 years ago
@DamonDH You mean as the optimization goes on, some sub-codewords become unused in the quantization indicator B(m)? First of all, it is possible that one or a few sub-codewords remain unused at the end of the optimization. However, such phenomenon is quite rare, as in my own experiments, most subspaces have used all the corresponding sub-codewords. You may need to check your initialization, as well as the updating strategy.
@jiaxiang-wu Yes,I have checked my indicator B(m), it has lost some sub-codewords.
1、I use the result of fc quantilization without error correction as my initialization, and I have evaluated my fc quantilization without error correction on test set, it's works well
2、as for the updating strategy , I tried both update one subspace at one time and all the subspace at on time by using loop outside the optimization.
Note: The previous version of the first paragraph (attached at the end of this answer) can be misleading. Sorry for that. Here is the updated one.
@DamonDH I would like to explain the computation of residual term first. In my implementation, the complete layer response T_n is firstly computed (as in Eq. 5). Then the approximate layer response, which equals to the sum of layer response contributed by each subspace, is subtracted from T_n to obtain the clean residual term. For each subspace, the actual residual term equals to the clean residual term plus the layer response contributed by that subspace (as in Eq. 9). After updating B(m) and D(m) in that subspace, an updated version of clean residual term is computed, which equals to the actual residual term plus the layer response contributed by that subspace.
According to your description, there should be no problem with the initialization. I used multiple iterations to optimize D and B, and in each iteration, all subspace are updated one-by-one, i.e.: 1-st iteration: 1-st subspace -> 2-nd subspace -> ... -> M-th subspace ... T-th iteration: 1-st subspace -> 2-nd subspace -> ... -> M-th subspace
########### Previous version ########### @DamonDH I would like to explain the computation of residual term first. In my implementation, the complete layer response T_n is firstly computed (as in Eq. 5). Then, for each subspace, the residual term is equal to T_n minus the layer response contributed by that subspace (as in Eq. 9). After updating B(m) and D(m) in that subspace, an updated version of layer response T_n is computed, which equals to the residual term plus the layer response contributed by that subspace.
@jiaxiang-wu It's seems I found the problem! many thanks! In my implement, I compute the T_n using forward progress result of the original model, and didn't update the it after updating B(m) and D(m). Do you think it's proper to take that T_n to compute residual ? In my opinion, our target is to optimize weights mat before quantilization not the quantilized weights mat. so i take the fc layer forward result as original output. of that's my personal view, and seems have been proved not work, or maybe there are some implement mistakes, I not sure about that.
Actually, there is another problem confused me a lot, It's that during updating the D, we need to construct MK (where M is the total num of subspaces, the K is the num of sub-codewords in a subspace) MSE problems to optimzie, right ? Ok, when update the D(m,k), we need to construct a MSE formulation, there is the problem that as in Eq. 11 the same input S(n,m) have different output (which is the residual or called labels), how shjoud I organize the input and output of MSE, as far as I concerned, I just concatenate the S(n,m) and R(n,m) vertically, which extend the num of input to Nlen(L_k) ? am I right about that ?
thank you in advance
@DamonDH Yes, you need to construct MK MSE problems to optimize the codebook D. For each MSE problem, these is no need to repeat each input S(n, m) multiple times to match the size of R(n, m). Actually, you can directly take the average of R(n, m) ( c_t ) over all c_t for each (sample, subspace) pair. It is easy to show that such simplification is equivalent to yours, but saves much computation.
@DamonDH The answer on how to compute the residual term has been updated. Sorry for the previous misleading answer.
@jiaxiang-wu About the residual computing.
according to your description above, that my understanding let assume we need to quantilized a FC layer:
for the 1-th subspace
clean_residual = T_n - W_q1 * S(n)
actual_residual = clean_residual + layer_response
T_n
is the orignal output by FC layer to be quantilized,
W_q1
is the 1-th weight matrix, which get from kmeans clusting.
S(n)
is the input of FC layer from the original model.
layer_response
is compute by the Eq. 9 in you paper.
for the 2-th subspace
clean_residual = T_n - W_q2 * S(n)
actual_residual = clean_residual + layer_response
W_q2
is the 2-th weight matrix, which updated by the first subpace.
that's what I think the updating procedure should be like. am I right about that ? or you can explain the it to me more straight way. thank you for kindness and patience.
@DamonDH Mostly correct. The only difference between your understanding and mine is that, for the second subspace, there is no need to re-compute the clean residual term using the whole weighting matrix. Instead, you can compute it more efficiently based on the first subspace's residual term and updated layer response. Nevertheless, your implementation is also correct.
Major procedures are as follows:
- clean_residual = T(n) - W0 * S(n) // W0 is the initial quantized weighting matrix, obtain via k-means clustering
- compute the layer response contributed by the first subspace only, denoted as P(1)
- actual_residual = clean_residual + P(1)
- optimize the first subspace's codebook and quantization indicators
- re-compute the layer response contributed by the first subspace only, denoted as Q(1)
- clean_residual = actual_residual - Q(1)
- repeat 2-6 for any following subspace to be updated
OK, thank you so much, everything seems to be clear, I gonna re-implement it. let's see it
@jiaxiang-wu Thanks to you help I have successfully implement the FC quantilization with Error Correction. but when test on my dataset( train 120k, validation 50k), the Acc. shows a decrease from 0.96 to 0.92. Is this reasonable ? After analysising, I found one possible reason is that my quantilized model only train once, which each subspace update only one time. that insufficiency training leads to such result, In the next, I plan to add a loop to the updating the subspace for many times. DO you have any suggestions or experience about the num of loops ? thanks a lot. And if there any other possible reasons?
thanks again !
yep, got it! thanks a lot!
@jiaxiang-wu Hello jiaxiang. after put an exreranal loop to update each subspace, I got a much worse result. here is my experiment setups:
Here is the result:
and my confusions:
hopeing you can give me some suggenstions or instruction. many thanks.
@DamonDH The result with EC should be better than that without EC. You may need to re-check your implementation. Here is one hint. Assuming you are learning the quantization parameters (D and B) with 50k training samples, then the layer response approximation error (as defined in Eq. 8) on this training subset should be monotonically decreasing. The reason is that both updating D and B will never increase this approximation error. You may verify this on your current implementation.
I try to implement the error correction for the FC, but I got a problam that as updaing D(m) and B(m) goes by, some subspces maping table B(m) trend to consuct by only a few sub words, that's leads to updating abort. ?
I try to figure out the reason, it seems that i miss uderstading the definion of residual ? the folling snippet is my implement of coumute the residual of one subspace on all the input.
` def conpute_all_residual_on_one_dim(self, sub_dim_index): construct_ouptput = np.zeros(self.N_resudial.shape) for i in xrange(self.num_sub_dims): if sub_dim_index == i: continue table_B = self.from_asmt_data_get_index_table(i)
print table_B.shape, self.centeroid_data[i].shape
`
@jiaxiang-wu