Cornell-RelaxML / quip-sharp

GNU General Public License v3.0
480 stars 42 forks source link

How to quant 1.3B model to 2bit #12

Closed ustcwhy closed 9 months ago

ustcwhy commented 9 months ago

Thanks for your wonderful work! I try to use quip-sharp to quantize my 1.3B model base on llama arch to 2bit. The config of my model is:

config = LlamaConfig( vocab_size=len(dictionary), hidden_size=2048, intermediate_size=5460, num_hidden_layers=24, num_attention_heads=32, num_key_value_heads=None, hidden_act="silu", max_position_embeddings=2048, initializer_range=0.02, rms_norm_eps=1e-6, use_cache=True, pad_token_id=dictionary.pad(), bos_token_id=dictionary.bos(), eos_token_id=dictionary.eos(), pretraining_tp=1, tie_word_embeddings=True, rope_theta=10000.0, rope_scaling=None, attention_bias=False, )

However, I encounter the following error:

Traceback (most recent call last): File "/home/xxx/anaconda3/envs/bit/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/xxx/anaconda3/envs/bit/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/quantize_llama.py", line 261, in quantize_layer_queue quantize_layer(next_item, cb, args, device, False) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/quantize_llama.py", line 246, in quantize_layer quantize_up(layer, idx, cb, args, device, check_only=not return_layer) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/quantize_llama.py", line 162, in quantize_up hatW, attr = quip.quantize(H, W_upgate, args.lora_rank, cb, args, device) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/algo/quip.py", line 331, in quantize incoh_out = incoherence_preprocess(H, W, args) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/algo/quip.py", line 48, in incoherence_preprocess Wr = RHT_W(Wr, SU, SV) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/algo/quip.py", line 13, in RHT_W return utils.matmul_hadUt(utils.matmul_hadUt(W.T SV).T SU) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/utils/matmul_had.py", line 96, in matmul_hadUt return matmul_hadU(X, transpose=True) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/utils/matmul_had.py", line 78, in matmul_hadU hadK, K = get_hadK(n, transpose) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/utils/matmul_had.py", line 16, in get_hadK assert (is_pow2(n // 156)) AssertionError

The n is 10920. Could you provide some suggestions to resolve the problem?

Thx~

tsengalb99 commented 9 months ago

This "error" is because we didn't store a hadamard matrix that works with 10920. 10920 is a rather annoying number since it decomposes into 2^3 3 5 7 13, rather than 2^q * k where k is an even "small" number. Furthermore, there are also more odd primes in that factorization than powers of two, which makes doing a nested transform hard. We will definitely have to make some code changes to handle your model size.

However, one thing you can do if you want to get around this without waiting for us is to "outlier channel split" the up and gate matrices to be 6144 (51212) or 5504 (32172). What this means is you split the output channels of up and gate and input channels of down so the output is the same as before. https://arxiv.org/abs/1901.09504 has more details but this can be done entirely offline. You will get a slightly larger model (I think 6.5M more params with 5504) but since you're quantizing to 2 bits that should not make a huge difference.

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Hongyu Wang @.> Sent: Saturday, December 9, 2023 11:32:02 AM To: Cornell-RelaxML/quip-sharp @.> Cc: Subscribed @.***> Subject: [Cornell-RelaxML/quip-sharp] How to quant 1.3B model to 2bit (Issue #12)

Thanks for your wonderful work! I try to use quip-sharp to quantize my 1.3B model base on llama arch to 2bit. The config of my model is:

config = LlamaConfig( vocab_size=len(dictionary), hidden_size=2048, intermediate_size=5460, num_hidden_layers=24, num_attention_heads=32, num_key_value_heads=None, hidden_act="silu", max_position_embeddings=2048, initializer_range=0.02, rms_norm_eps=1e-6, use_cache=True, pad_token_id=dictionary.pad(), bos_token_id=dictionary.bos(), eos_token_id=dictionary.eos(), pretraining_tp=1, tie_word_embeddings=True, rope_theta=10000.0, rope_scaling=None, attention_bias=False, )

However, I encounter the following error:

Traceback (most recent call last): File "/home/xxx/anaconda3/envs/bit/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/xxx/anaconda3/envs/bit/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/quantize_llama.py", line 261, in quantize_layer_queue quantize_layer(next_item, cb, args, device, False) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/quantize_llama.py", line 246, in quantize_layer quantize_up(layer, idx, cb, args, device, check_only=not return_layer) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/quantize_llama.py", line 162, in quantize_up hatW, attr = quip.quantize(H, W_upgate, args.lora_rank, cb, args, device) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/algo/quip.py", line 331, in quantize incoh_out = incoherence_preprocess(H, W, args) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/algo/quip.py", line 48, in incoherence_preprocess Wr = RHT_W(Wr, SU, SV) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/algo/quip.py", line 13, in RHT_W return utils.matmul_hadUt(utils.matmul_hadUt(W.T SV).T SU) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/utils/matmul_had.py", line 96, in matmul_hadUt return matmul_hadU(X, transpose=True) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/utils/matmul_had.py", line 78, in matmul_hadU hadK, K = get_hadK(n, transpose) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/utils/matmul_had.py", line 16, in get_hadK assert (is_pow2(n // 156)) AssertionError

The n is 10920. Could you provide some suggestions to resolve the problem?

Thx~

— Reply to this email directly, view it on GitHubhttps://github.com/Cornell-RelaxML/quip-sharp/issues/12, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH6WZSDNJFFX3GJIVW26F2TYISHAFAVCNFSM6AAAAABAN4XTN6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZTGOJSGMZTINY. You are receiving this because you are subscribed to this thread.Message ID: @.***>

ustcwhy commented 9 months ago

This "error" is because we didn't store a hadamard matrix that works with 10920. 10920 is a rather annoying number since it decomposes into 2^3 3 5 7 13, rather than 2^q k where k is an even "small" number. Furthermore, there are also more odd primes in that factorization than powers of two, which makes doing a nested transform hard. We will definitely have to make some code changes to handle your model size. However, one thing you can do if you want to get around this without waiting for us is to "outlier channel split" the up and gate matrices to be 6144 (51212) or 5504 (32*172). What this means is you split the output channels of up and gate and input channels of down so the output is the same as before. https://arxiv.org/abs/1901.09504 has more details but this can be done entirely offline. You will get a slightly larger model (I think 6.5M more params with 5504) but since you're quantizing to 2 bits that should not make a huge difference. Get Outlook for Androidhttps://aka.ms/AAb9ysg ____ From: Hongyu Wang @.> Sent: Saturday, December 9, 2023 11:32:02 AM To: Cornell-RelaxML/quip-sharp @.> Cc: Subscribed @.> Subject: [Cornell-RelaxML/quip-sharp] How to quant 1.3B model to 2bit (Issue #12) Thanks for your wonderful work! I try to use quip-sharp to quantize my 1.3B model base on llama arch to 2bit. The config of my model is: config = LlamaConfig( vocab_size=len(dictionary), hidden_size=2048, intermediate_size=5460, num_hidden_layers=24, num_attention_heads=32, num_key_value_heads=None, hidden_act="silu", max_position_embeddings=2048, initializer_range=0.02, rms_norm_eps=1e-6, use_cache=True, pad_token_id=dictionary.pad(), bos_token_id=dictionary.bos(), eos_token_id=dictionary.eos(), pretraining_tp=1, tie_word_embeddings=True, rope_theta=10000.0, rope_scaling=None, attention_bias=False, ) However, I encounter the following error: Traceback (most recent call last): File "/home/xxx/anaconda3/envs/bit/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/xxx/anaconda3/envs/bit/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/quantize_llama.py", line 261, in quantize_layer_queue quantize_layer(next_item, cb, args, device, False) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/quantize_llama.py", line 246, in quantize_layer quantize_up(layer, idx, cb, args, device, check_only=not return_layer) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/quantize_llama.py", line 162, in quantize_up hatW, attr = quip.quantize(H, W_upgate, args.lora_rank, cb, args, device) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/algo/quip.py", line 331, in quantize incoh_out = incoherence_preprocess(H, W, args) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/algo/quip.py", line 48, in incoherence_preprocess Wr = RHT_W(Wr, SU, SV) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/algo/quip.py", line 13, in RHT_W return utils.matmul_hadUt(utils.matmul_hadUt(W.T SV).T * SU) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/utils/matmul_had.py", line 96, in matmul_hadUt return matmul_hadU(X, transpose=True) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/utils/matmul_had.py", line 78, in matmul_hadU hadK, K = get_hadK(n, transpose) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/utils/matmul_had.py", line 16, in get_hadK assert (is_pow2(n // 156)) AssertionError The n is 10920. Could you provide some suggestions to resolve the problem? Thx~ — Reply to this email directly, view it on GitHub<#12>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH6WZSDNJFFX3GJIVW26F2TYISHAFAVCNFSM6AAAAABAN4XTN6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZTGOJSGMZTINY. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Thanks for your fast reply! I will try to train a model with 5504 intermediate ffn dimension~

tsengalb99 commented 9 months ago

Just to be clear, you don't need to retrain a 5504 model. You can take your existing 5460 model and split 44 channels (and the corresponding parts of the hessians) and that should work. Of course, retraining will probably give better results.

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Hongyu Wang @.> Sent: Saturday, December 9, 2023 12:38:57 PM To: Cornell-RelaxML/quip-sharp @.> Cc: Albert Tseng @.>; Comment @.> Subject: Re: [Cornell-RelaxML/quip-sharp] How to quant 1.3B model to 2bit (Issue #12)

This "error" is because we didn't store a hadamard matrix that works with 10920. 10920 is a rather annoying number since it decomposes into 2^3 3 5 7 13, rather than 2^q * k where k is an even "small" number. Furthermore, there are also more odd primes in that factorization than powers of two, which makes doing a nested transform hard. We will definitely have to make some code changes to handle your model size. However, one thing you can do if you want to get around this without waiting for us is to "outlier channel split" the up and gate matrices to be 6144 (51212) or 5504 (32172). What this means is you split the output channels of up and gate and input channels of down so the output is the same as before. https://arxiv.org/abs/1901.09504 has more details but this can be done entirely offline. You will get a slightly larger model (I think 6.5M more params with 5504) but since you're quantizing to 2 bits that should not make a huge difference. Get Outlook for Androidhttps://aka.ms/AAb9ysg … ____ From: Hongyu Wang @.> Sent: Saturday, December 9, 2023 11:32:02 AM To: Cornell-RelaxML/quip-sharp @.> Cc: Subscribed @.**> Subject: [Cornell-RelaxML/quip-sharp] How to quant 1.3B model to 2bit (Issue #12https://github.com/Cornell-RelaxML/quip-sharp/issues/12) Thanks for your wonderful work! I try to use quip-sharp to quantize my 1.3B model base on llama arch to 2bit. The config of my model is: config = LlamaConfig( vocab_size=len(dictionary), hidden_size=2048, intermediate_size=5460, num_hidden_layers=24, num_attention_heads=32, num_key_value_heads=None, hidden_act="silu", max_position_embeddings=2048, initializer_range=0.02, rms_norm_eps=1e-6, use_cache=True, pad_token_id=dictionary.pad(), bos_token_id=dictionary.bos(), eos_token_id=dictionary.eos(), pretraining_tp=1, tie_word_embeddings=True, rope_theta=10000.0, rope_scaling=None, attention_bias=False, ) However, I encounter the following error: Traceback (most recent call last): File "/home/xxx/anaconda3/envs/bit/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/xxx/anaconda3/envs/bit/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/quantize_llama.py", line 261, in quantize_layer_queue quantize_layer(next_item, cb, args, device, False) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/quantize_llama.py", line 246, in quantize_layer quantize_up(layer, idx, cb, args, device, check_only=not return_layer) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/quantize_llama.py", line 162, in quantize_up hatW, attr = quip.quantize(H, W_upgate, args.lora_rank, cb, args, device) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/algo/quip.py", line 331, in quantize incoh_out = incoherence_preprocess(H, W, args) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/algo/quip.py", line 48, in incoherence_preprocess Wr = RHT_W(Wr, SU, SV) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/algo/quip.py", line 13, in RHT_W return utils.matmul_hadUt(utils.matmul_hadUt(W.T SV).T SU) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/utils/matmul_had.py", line 96, in matmul_hadUt return matmul_hadU(X, transpose=True) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/utils/matmul_had.py", line 78, in matmul_hadU hadK, K = get_hadK(n, transpose) File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/utils/matmul_had.py", line 16, in get_hadK assert (is_pow2(n // 156)) AssertionError The n is 10920. Could you provide some suggestions to resolve the problem? Thx~ — Reply to this email directly, view it on GitHub<#12https://github.com/Cornell-RelaxML/quip-sharp/issues/12>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH6WZSDNJFFX3GJIVW26F2TYISHAFAVCNFSM6AAAAABAN4XTN6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZTGOJSGMZTINY. You are receiving this because you are subscribed to this thread.Message ID: @.>

Thanks for your fast reply! I will try to train a model with 5504 intermediate ffn dimension~

— Reply to this email directly, view it on GitHubhttps://github.com/Cornell-RelaxML/quip-sharp/issues/12#issuecomment-1848592266, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH6WZSBKXBORE6LPE3QADDTYISO3DAVCNFSM6AAAAABAN4XTN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBYGU4TEMRWGY. You are receiving this because you commented.Message ID: @.***>