Closed VirtualRoyalty closed 6 months ago
Hi! It is recommended to use the seq_len the model you're quantizing was trained on (4096 for Llama-2, 8192 for mistral/mixtral). To reduce the number of samples, speeding up computations, you should decrease --nsamples
instead.
However, it doesn't have that large impact on the quantization time anyway.
@BlackSamorez Thanks for the answer!
I am trying to quantize finetuned version of mixtral and I had no such long samples (8192) in the training set.
Then should I decrease max_epochs and finetune_max_epochs instead (in order to speedup the process)?
@VirtualRoyalty you may try and see how shorter sequences affect the quality. When I was tuning Mixtral, i used 7k instead of 8k to fit into memory and this seems to work fine. However, 1k is much shorter than 8k, so I cannot say apriori, whether it matters much.
@Godofnothing Thanks, good point!
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Hi! Thanks for such a useful tool! I have a question about
model_seqlen
:As I can see default value in main.py is 4096. What if I'll use a smaller values e.g. 1024 when quantizing MoE mixtral model? Will it affect the quality of quantized model? Or quality on greater than 1024 contexts? Will it significantly speedup process of quantization?
Thanks in advance!