abertsch72 / unlimiformer

Public repo for the NeurIPS 2023 paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input"
MIT License
1.05k stars 77 forks source link

GPU VRAM Usage during training #58

Open KevinD777 opened 8 months ago

KevinD777 commented 8 months ago

Hi,

Thanks for your great work! I have some questions regarding the GPU usage when training with LLaMa 2:

  1. What is the peak usage of the VRAM when training the Unlimiformer using the long-range training methods in both 8k and 16k settings?
  2. Since the complexity is linear during training, training in 16k is around double VRAM than 8k if I understand correctly. So if I want to train Unilimiformer in 80k, it would be 10 times more VRAM usage than 8k?
  3. I saw in a previous issue that currently Unilimiformer could only be trained in a single GPU, so the training length will be limited on the max single GPU RAM, say 80GB for A100 GPU. So I am curious is the 16k training length is the max possible length for now?

Thanks!

abertsch72 commented 7 months ago

Thanks for your interest!

  1. Looking back at some old run data, I'm seeing ~45Gb of GPU memory for BART-base with 16k max length (using retrieval training). I don't have numbers handy for the 8k case right now, but I'd guess somewhere a little less than halfway between there and the cost of finetuning BART without Unlimiformer.
  2. Roughly, yes-- there's some fixed cost for storing the model weights itself, but most of the memory required comes from the input+computational graphs. So it would be slightly less than 10x more expensive, but that's the right ballpark.
  3. This depends on the model size and your GPU size-- in the paper we were using BART-base and a 48-GB GPU, so we were limited to ~16k.