Closed Hegelim closed 1 year ago
For this model? What size? that of course matters a lot.
You would not pre-train. But it'd be the same answer for fine-tuning. And there really isn't a single answer, because you can trade off memory for speed in many ways. GPUs like the A100 are ideal, as they require little tradeoff. A10 and V100 are viable (see README) for the smaller 3B/7B, but significantly slower. I don't think anything with 16GB is viable.
For generation, the rule of thumb is of course that the model needs 2x parameters in bytes because you will load in 16-bit. You can load in 8-bit for half the mem of course, at some cost to accuracy. And you need enough room for your input which depends on input. So again no one answer. A10 is possible for 12B but not ideal; A100 is ideal. For 7B/3B, A10 is fine and T4/V100 are possible, less ideal.
Thanks @srowen for your detailed answer. I have some follow-up questions:
Pre-training is not recommended. This is the kind of thing that can cost a million bucks. Fine-tuning could be fine. I'm saying the task is the same, so hardware considerations are the same.
Fine-tuning is just fairly different. It will in general need more memory. It also offers more possibilities to trade off mem for speed as you can afford latency. See: deepspeed
The answer depends on the model, your input, and how you train. the answer is quite different for 3B vs 12B here, even. I would tell you, as above and in the README, that A100s are ideal for training (40GB). Anything else will require tradeoffs to work that will cost more and take a lot longer. This is for fine-tuning. For 12B, you could expect to spend thousands of dollars, probably, if that helps. Not $100 and not $10000, probably.
Thank you so much for your reply. I will close this thread now.
What's the general rule of thumb when it comes to estimating how much GPU memory I would need?
And how would the rough estimate of 2x number of parameters fit in here? Does it refer to simply loading the model?