silverlining21 commented 7 years ago

I try to implement the error correction for the FC, but I got a problam that as updaing D(m) and B(m) goes by, some subspces maping table B(m) trend to consuct by only a few sub words, that's leads to updating abort. ?

I try to figure out the reason, it seems that i miss uderstading the definion of residual ? the folling snippet is my implement of coumute the residual of one subspace on all the input.

` def conpute_all_residual_on_one_dim(self, sub_dim_index): construct_ouptput = np.zeros(self.N_resudial.shape) for i in xrange(self.num_sub_dims): if sub_dim_index == i: continue table_B = self.from_asmt_data_get_index_table(i)

print table_B.shape, self.centeroid_data[i].shape

    res_dot = np.dot(table_B, self.centeroid_data[i].T)                   # dot([4*64], [64*6916])
    # sum up residual construct by different subspace              # N*6915 = dot([N*4], dot([4*64], [64*6916]))
    construct_ouptput += np.dot(self.feat_in[:, i*self.len_sub_dim:(i+1)*self.len_sub_dim], res_dot.T)
self.N_resudial = self.feat_out - construct_ouptput

`

@jiaxiang-wu

jiaxiang-wu commented 7 years ago

@DamonDH You mean as the optimization goes on, some sub-codewords become unused in the quantization indicator B(m)? First of all, it is possible that one or a few sub-codewords remain unused at the end of the optimization. However, such phenomenon is quite rare, as in my own experiments, most subspaces have used all the corresponding sub-codewords. You may need to check your initialization, as well as the updating strategy.

silverlining21 commented 7 years ago

@jiaxiang-wu Yes，I have checked my indicator B(m), it has lost some sub-codewords.

1、I use the result of fc quantilization without error correction as my initialization, and I have evaluated my fc quantilization without error correction on test set, it's works well

2、as for the updating strategy , I tried both update one subspace at one time and all the subspace at on time by using loop outside the optimization.

jiaxiang-wu commented 7 years ago

Note: The previous version of the first paragraph (attached at the end of this answer) can be misleading. Sorry for that. Here is the updated one.

@DamonDH I would like to explain the computation of residual term first. In my implementation, the complete layer response T_n is firstly computed (as in Eq. 5). Then the approximate layer response, which equals to the sum of layer response contributed by each subspace, is subtracted from T_n to obtain the clean residual term. For each subspace, the actual residual term equals to the clean residual term plus the layer response contributed by that subspace (as in Eq. 9). After updating B(m) and D(m) in that subspace, an updated version of clean residual term is computed, which equals to the actual residual term plus the layer response contributed by that subspace.

According to your description, there should be no problem with the initialization. I used multiple iterations to optimize D and B, and in each iteration, all subspace are updated one-by-one, i.e.: 1-st iteration: 1-st subspace -> 2-nd subspace -> ... -> M-th subspace ... T-th iteration: 1-st subspace -> 2-nd subspace -> ... -> M-th subspace

########### Previous version ########### @DamonDH I would like to explain the computation of residual term first. In my implementation, the complete layer response T_n is firstly computed (as in Eq. 5). Then, for each subspace, the residual term is equal to T_n minus the layer response contributed by that subspace (as in Eq. 9). After updating B(m) and D(m) in that subspace, an updated version of layer response T_n is computed, which equals to the residual term plus the layer response contributed by that subspace.

silverlining21 commented 7 years ago

@jiaxiang-wu It's seems I found the problem! many thanks! In my implement, I compute the T_n using forward progress result of the original model, and didn't update the it after updating B(m) and D(m). Do you think it's proper to take that T_n to compute residual ? In my opinion, our target is to optimize weights mat before quantilization not the quantilized weights mat. so i take the fc layer forward result as original output. of that's my personal view, and seems have been proved not work, or maybe there are some implement mistakes, I not sure about that.

Actually, there is another problem confused me a lot, It's that during updating the D, we need to construct MK (where M is the total num of subspaces, the K is the num of sub-codewords in a subspace) MSE problems to optimzie, right ? Ok, when update the D(m,k), we need to construct a MSE formulation, there is the problem that as in Eq. 11 the same input S(n,m) have different output (which is the residual or called labels), how shjoud I organize the input and output of MSE, as far as I concerned, I just concatenate the S(n,m) and R(n,m) vertically, which extend the num of input to Nlen(L_k) ? am I right about that ?

thank you in advance

jiaxiang-wu commented 7 years ago

@DamonDH Yes, you need to construct MK MSE problems to optimize the codebook D. For each MSE problem, these is no need to repeat each input S(n, m) multiple times to match the size of R(n, m). Actually, you can directly take the average of R(n, m) ( c_t ) over all c_t for each (sample, subspace) pair. It is easy to show that such simplification is equivalent to yours, but saves much computation.

jiaxiang-wu commented 7 years ago

@DamonDH The answer on how to compute the residual term has been updated. Sorry for the previous misleading answer.

silverlining21 commented 7 years ago

@jiaxiang-wu About the residual computing. according to your description above, that my understanding let assume we need to quantilized a FC layer: for the 1-th subspace clean_residual = T_n - W_q1 * S(n) actual_residual = clean_residual + layer_response T_n is the orignal output by FC layer to be quantilized, W_q1 is the 1-th weight matrix, which get from kmeans clusting. S(n) is the input of FC layer from the original model. layer_response is compute by the Eq. 9 in you paper.

for the 2-th subspace clean_residual = T_n - W_q2 * S(n) actual_residual = clean_residual + layer_response W_q2 is the 2-th weight matrix, which updated by the first subpace.

that's what I think the updating procedure should be like. am I right about that ? or you can explain the it to me more straight way. thank you for kindness and patience.

jiaxiang-wu commented 7 years ago

@DamonDH Mostly correct. The only difference between your understanding and mine is that, for the second subspace, there is no need to re-compute the clean residual term using the whole weighting matrix. Instead, you can compute it more efficiently based on the first subspace's residual term and updated layer response. Nevertheless, your implementation is also correct.

Major procedures are as follows:

clean_residual = T(n) - W0 * S(n) // W0 is the initial quantized weighting matrix, obtain via k-means clustering

compute the layer response contributed by the first subspace only, denoted as P(1)

actual_residual = clean_residual + P(1)

optimize the first subspace's codebook and quantization indicators

re-compute the layer response contributed by the first subspace only, denoted as Q(1)

clean_residual = actual_residual - Q(1)

repeat 2-6 for any following subspace to be updated

silverlining21 commented 7 years ago

OK, thank you so much, everything seems to be clear, I gonna re-implement it. let's see it

silverlining21 commented 7 years ago

@jiaxiang-wu Thanks to you help I have successfully implement the FC quantilization with Error Correction. but when test on my dataset( train 120k, validation 50k), the Acc. shows a decrease from 0.96 to 0.92. Is this reasonable ? After analysising, I found one possible reason is that my quantilized model only train once, which each subspace update only one time. that insufficiency training leads to such result, In the next, I plan to add a loop to the updating the subspace for many times. DO you have any suggestions or experience about the num of loops ？ thanks a lot. And if there any other possible reasons?

thanks again !

jiaxiang-wu commented 7 years ago

For fully-connected layers, parameter quantization with/without error correction does not differ greatly (convolutional layers can benefit much more from error correction). Still, the error correction scheme should obtain better approximation, so a drop from 0.96 to 0.92 in the accuracy degradation is reasonable.
You need an external loop to update each subspace multiple times. In each iteration, all subspaces are sequentially updated.
You can determine when to stop the iterations by monitoring the objective function's value. If the improvement is too minor, you may need a early-stopping. Or, you may set a maximal number of iterations (in my experiments, I use 10/20.)

silverlining21 commented 7 years ago

yep, got it! thanks a lot!

silverlining21 commented 7 years ago

@jiaxiang-wu Hello jiaxiang. after put an exreranal loop to update each subspace, I got a much worse result. here is my experiment setups:

training set 1.8 million, take shuffled 51.2k image each time to update all subspaces sequentially, to finish one pass on trainning set need bout 36 times updates.
as for validation set, there about 50k images.

Here is the result: test

and my confusions:

Is this reasonable that the best result of EC(95.33%) never better than the baseline(96.84%), which come from the K-means clustering without the EC ?
as shown above, as traing goes by, the result seems becoming far form good and show no sign of coming back to normal again. So I wonder if it's possible that something wrong in my implement ? after that I reviewed of code and got nothing in return.
if there any hyperparameters that I should pay extra attemtion ?

hopeing you can give me some suggenstions or instruction. many thanks.

jiaxiang-wu commented 7 years ago

@DamonDH The result with EC should be better than that without EC. You may need to re-check your implementation. Here is one hint. Assuming you are learning the quantization parameters (D and B) with 50k training samples, then the layer response approximation error (as defined in Eq. 8) on this training subset should be monotonically decreasing. The reason is that both updating D and B will never increase this approximation error. You may verify this on your current implementation.

CAS-CLab / quantized-cnn

implement issues about error correction for fully connect layer ？ #9

print table_B.shape, self.centeroid_data[i].shape