Open harshsikka opened 1 year ago
I am using the cost of training LLM models as a reference point for GATO estimate, GATO is using similar transformers architecture as LLMs.
Refer to https://blog.eleuther.ai/transformer-math/ for cost estimation of training LLMs, in particular I am using the following equation tauT=6PD For A100, tau = 312 tera FLOPS = 312 10^12 with FP16 or BFLOAT16, refer to the following for the A100 FLOPS https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf Suppose our MVP model needs to train a model of 1B parameters, i.e P = 10^9, and suppose we are training on 1B tokens (but that URL for LLM cost estimation recommends no less than 200B tokens for a serious LLM model). Let's use 1B to do the calculation and then we scale it as we need, so assume D = 10^9 (edited) Going through the equation, let's calculate T 312 10^12 T = 6 10^9 10^9, so T = 19230 seconds, that translates to one (1) A100 training for about 5.34 hours, so if we are training 200B tokens, it will be about 1100 hours on one (1) A100 (edited) And we can scale accordingly.
Suppose $1.3 per hour for A100 GPU time, so the cost will be $1400-$1500, sounds very low, is that feasible? Are we missing anything?
And the following is an estimate by Daniel based on his running of the 79M-parameter NEKO model. Copy and paste from his Discord msg: I was able to get a Docker setup this both satisfies our original dependencies and having distributed training actually working with accelerate (switching to cudnn8-devel in the Docker mostly did the trick). I tested training at full 79M param, 512 batch size, and 1024 context len on 4xA6000. I roughly track time (per grad update) and VRAM usage. I created both a guide and a video (unfortunately no narration, back home, people sleeping), first here is the guide: https://docs.google.com/document/d/1W_dN3qarCOcLRDdEZ75LBtkLGiwUziWWDtVTjd43Ad4/edit?usp=sharing . With this setup, training Gato for 1M grad updates, w/ 512 batch size, 1024 context len, on 4xA6000 with fp16, using grad accumulation to hit the 512 batch size, would hit a total cost $2748.67 , which takes 64.81481481 days to train (variable and rough estimate). Same specifications, but only half contxt (512), and slightly adjusted grad accum strategy leads to $1030 and 24.3 days.
This isn't amazing, but this is a pretty good start, and there are many axis for improvement. Simply going 4xA6000 -> 8xA6000 on vast is possible if offerings, so can maybe hit similar total costs but half the training time, which would be good. Definitely several places we can optimize our implementation through code changes. We should also look at Fully Sharded Data Parallel (FSDP) https://huggingface.co/blog/pytorch-fsdp , which is supported by accelerate, which will also lead to further gains. This was also used by LAION for their openflamingo project. Also read through for other ideas: https://huggingface.co/docs/accelerate/index
And here is a video link about how Daniel did the above-mentioned https://discord.com/channels/755517485096108153/1107862498541043762/1129660159615041547
The two estimates have huge gaps: Daniel'sNEKO model has only 79M parameters, and numbers of tokens to TBD (but I guess definitely much less than, 200B), yet it has a higher estimates cost.
Assume training cost is proportional to the product of the two numbers P (# number of parameters) and D (# of tokens) as mentioned in https://blog.eleuther.ai/transformer-math/. If we scale from 79M to 1B parameters (our first MVP is about 1B parameters) linearly, and also scale according to the number of tokens, the training cost could be about $300K-$1M or even more (2748*1B/79M is about 350K, that is without scaling according to number of tokens). I do not feel such a number sounds right.
Here is my first take about the possible reason (assuming the LLM estimate based on the formula tau * T = 6PD is correct, we will dig further to prove or disprove that):
The GATO paper has a statement “Training of the model is performed on a 16x16 TPU v3 slice for 1M steps with batch size 512 and token sequence length L = 1024, which takes about 4 days”
https://cloud.google.com/tpu/docs/system-architecture-tpu-vm mentions:
TPU v3 Podslices are available with 32, 128, 512, 1024, or 2048 TensorCores. Each v3 TPU chip contains two TensorCores. So we assume the "16x16 TPU v3 slice" from the original paper should be a 512-TensorCore slice, or 256 chips (therefore the "16x16 TPU v3 slice")
Each TPU v3 chip supports 123 tera FLOPS (BF16 ) peak performance.
If our assumption in bullet 1 is correct, that slice supports 256 123tera FLOPS (FP16), and the training lasted 4 days. Assume A100 supports 312 tera FLOPS (FP16), then that is equivalent to 100 A100s to train 4 days (256123/312 = 100.9), that translates to 9600 hours of A100, if each A100 hour costs $1.3, then the total training cost is about $12K.
The above-mentioned approach stays the same even we need to modify some of our assumptions.
Basically, I think using number of float operations is a universal measure. And estimate based off GATO paper should be considered more credible than those based off pure LLM models estimate or extrapolating from 79M NEKO models. The former is a pure language model and does not have the multiple modality, the latter is still work in progress.
Also, we need to consider some overhead to maintain distributed training, i.e. the total FLOPS of a group of A100's is not a simple FLOPS per A100 multiplied by the number of A100's, there will be some discount or loss of performance. Add also need to add some cost for compute when we are developing the model before we can fully train it
Issue was taken care of by Henry in its entirety back in July, leaving this open as the information needs update given today's updated NEKO model and scope. @henryj18 I understand after discussing with Harsh that there was a google doc that had a lot of the work you had done back in July fully documented, can you please comment below with that doc? @harshsikka next step, we need to redefine the scope on this issue to match the updated NEKO model you have/are working on
We need more compute & storage than is individually available to us via local GPUs to train the MVM outlined on our Roadmap:
To engage with any potential compute & data storage providers who may be interested in, we should have a concrete understanding of what we precisely need, and be able to make a corresponding ask.
Some ways to constrain this:
Outcome: Brief analysis of our needs in writeup form. This will then be used in potential proposals/pitches to compute partners.