Cornell-RelaxML / quip-sharp

GNU General Public License v3.0
509 stars 44 forks source link

Memory requirement #62

Open dorsa-zeinali opened 4 months ago

dorsa-zeinali commented 4 months ago

Hi, I hope you're doing well. I am a researcher at northeastern university trying to replicate your quantization results for llama-2-7b, and I can do so without finetuning without running out of memory, but I was wondering, for fine-tuning during quantization and also fine-tuning post quantization if one were to do each separately, what is the memory requirement for each step? I cannot do any context size larger than 2048 without running out of memory for the post quantization fine-tuning step, and it says in your paper that you used NVIDIA A100 gpus, but it did not specify how much memory they had (40GB or 80GB). I have access to 4 GPUs, which each have 48GB of memory. I would appreciate any insights you have. Thank you.

tsengalb99 commented 4 months ago

We used 80GB A100s. I don't remember the exact requirements, but you should be able to fit more than 2048 tokens with 48GB. You may want to avoid manifesting the entire model (turn off the --train_mode flag) and also try using activation checkpointing.

dorsa-zeinali commented 4 months ago

Thank you.

lzd19981105 commented 2 months ago

I use A100 80GB to finetune_e2e llama2-7b-chat-4bit with ctx_size 4096, but also encouter with oom issue caused by the code below W_decompressed = quiptools_cuda.decompress_packed_e8p( Qidxs_list[0].view(m // 16, n // 64, 8, 4), self.codebook. grid_packed_abs) + quiptools_cuda.decompress_packed_e8p( Qidxs_list[1].view(m // 16, n // 64, 8, 4), self.codebook.grid_packed_abs) / resid_scale x = (x.to(torch.float16) @ W_decompressed.T).to(torch.float32) I was wondering why?

dorsa-zeinali commented 1 week ago

hi, what context size and devset size do you think is reasonable for the e2e finetuning step given that I have 1 gpu with 48GB?

tsengalb99 commented 1 week ago

The e2e fine tuning script is pretty poor written and is not very memory efficient. All it does is "train" the quantized model by only updating unquantized parameters such as the LM head and layernorms. This means it has to backprop through the entire model (even the quantized parts) and dequantize the weights during the forward and backward pass. I suspect torch autograd is storing W_decompressed for the backward pass since the actual operation is x@W_decompressed.T. You can write a custom backward pass that decompresses the weights again, which should save a lot of memory.