ManifoldRG / NEKO

In Progress Implementation of GATO style Generalist Multimodal model capable of image, text, RL and Robotics tasks
https://discord.gg/brsPnzNd8h
GNU General Public License v3.0
46 stars 10 forks source link

Resource Analysis #40

Open harshsikka opened 1 year ago

harshsikka commented 1 year ago

We need more compute & storage than is individually available to us via local GPUs to train the MVM outlined on our Roadmap:

To engage with any potential compute & data storage providers who may be interested in, we should have a concrete understanding of what we precisely need, and be able to make a corresponding ask.

Some ways to constrain this:

Outcome: Brief analysis of our needs in writeup form. This will then be used in potential proposals/pitches to compute partners.

henryj18 commented 1 year ago

I am using the cost of training LLM models as a reference point for GATO estimate, GATO is using similar transformers architecture as LLMs.

Refer to https://blog.eleuther.ai/transformer-math/ for cost estimation of training LLMs, in particular I am using the following equation tauT=6PD For A100, tau = 312 tera FLOPS = 312 10^12 with FP16 or BFLOAT16, refer to the following for the A100 FLOPS https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf Suppose our MVP model needs to train a model of 1B parameters, i.e P = 10^9, and suppose we are training on 1B tokens (but that URL for LLM cost estimation recommends no less than 200B tokens for a serious LLM model). Let's use 1B to do the calculation and then we scale it as we need, so assume D = 10^9 (edited) Going through the equation, let's calculate T 312 10^12 T = 6 10^9 10^9, so T = 19230 seconds, that translates to one (1) A100 training for about 5.34 hours, so if we are training 200B tokens, it will be about 1100 hours on one (1) A100 (edited) And we can scale accordingly.

Suppose $1.3 per hour for A100 GPU time, so the cost will be $1400-$1500, sounds very low, is that feasible? Are we missing anything?

henryj18 commented 1 year ago

And the following is an estimate by Daniel based on his running of the 79M-parameter NEKO model. Copy and paste from his Discord msg: I was able to get a Docker setup this both satisfies our original dependencies and having distributed training actually working with accelerate (switching to cudnn8-devel in the Docker mostly did the trick). I tested training at full 79M param, 512 batch size, and 1024 context len on 4xA6000. I roughly track time (per grad update) and VRAM usage. I created both a guide and a video (unfortunately no narration, back home, people sleeping), first here is the guide: https://docs.google.com/document/d/1W_dN3qarCOcLRDdEZ75LBtkLGiwUziWWDtVTjd43Ad4/edit?usp=sharing . With this setup, training Gato for 1M grad updates, w/ 512 batch size, 1024 context len, on 4xA6000 with fp16, using grad accumulation to hit the 512 batch size, would hit a total cost $2748.67 , which takes 64.81481481 days to train (variable and rough estimate). Same specifications, but only half contxt (512), and slightly adjusted grad accum strategy leads to $1030 and 24.3 days.

This isn't amazing, but this is a pretty good start, and there are many axis for improvement. Simply going 4xA6000 -> 8xA6000 on vast is possible if offerings, so can maybe hit similar total costs but half the training time, which would be good. Definitely several places we can optimize our implementation through code changes. We should also look at Fully Sharded Data Parallel (FSDP) https://huggingface.co/blog/pytorch-fsdp , which is supported by accelerate, which will also lead to further gains. This was also used by LAION for their openflamingo project. Also read through for other ideas: https://huggingface.co/docs/accelerate/index

And here is a video link about how Daniel did the above-mentioned https://discord.com/channels/755517485096108153/1107862498541043762/1129660159615041547

henryj18 commented 1 year ago

The two estimates have huge gaps: Daniel'sNEKO model has only 79M parameters, and numbers of tokens to TBD (but I guess definitely much less than, 200B), yet it has a higher estimates cost.

Assume training cost is proportional to the product of the two numbers P (# number of parameters) and D (# of tokens) as mentioned in https://blog.eleuther.ai/transformer-math/. If we scale from 79M to 1B parameters (our first MVP is about 1B parameters) linearly, and also scale according to the number of tokens, the training cost could be about $300K-$1M or even more (2748*1B/79M is about 350K, that is without scaling according to number of tokens). I do not feel such a number sounds right.

Here is my first take about the possible reason (assuming the LLM estimate based on the formula tau * T = 6PD is correct, we will dig further to prove or disprove that):

  1. Game/proprioception models (trained as LLM with tokenization) needs much more training resource than real LLM models
  2. Our current NEKO models needs to be more optimized for resource
  3. There are some fixed resource requirement that is not proportional to model size and number of tokens. Because of that, training small model like our current 79M GATO might need much more fixed resource than the training loops, whereas the fixed resource requirement is dwarfed by the resource required by training loops for huge models
henryj18 commented 1 year ago

The GATO paper has a statement “Training of the model is performed on a 16x16 TPU v3 slice for 1M steps with batch size 512 and token sequence length L = 1024, which takes about 4 days”

https://cloud.google.com/tpu/docs/system-architecture-tpu-vm mentions:

  1. TPU v3 Podslices are available with 32, 128, 512, 1024, or 2048 TensorCores. Each v3 TPU chip contains two TensorCores. So we assume the "16x16 TPU v3 slice" from the original paper should be a 512-TensorCore slice, or 256 chips (therefore the "16x16 TPU v3 slice")

  2. Each TPU v3 chip supports 123 tera FLOPS (BF16 ) peak performance.

If our assumption in bullet 1 is correct, that slice supports 256 123tera FLOPS (FP16), and the training lasted 4 days. Assume A100 supports 312 tera FLOPS (FP16), then that is equivalent to 100 A100s to train 4 days (256123/312 = 100.9), that translates to 9600 hours of A100, if each A100 hour costs $1.3, then the total training cost is about $12K.

The above-mentioned approach stays the same even we need to modify some of our assumptions.

Basically, I think using number of float operations is a universal measure. And estimate based off GATO paper should be considered more credible than those based off pure LLM models estimate or extrapolating from 79M NEKO models. The former is a pure language model and does not have the multiple modality, the latter is still work in progress.

henryj18 commented 1 year ago

Also, we need to consider some overhead to maintain distributed training, i.e. the total FLOPS of a group of A100's is not a simple FLOPS per A100 multiplied by the number of A100's, there will be some discount or loss of performance. Add also need to add some cost for compute when we are developing the model before we can fully train it

BobakBagheri commented 11 months ago

Issue was taken care of by Henry in its entirety back in July, leaving this open as the information needs update given today's updated NEKO model and scope. @henryj18 I understand after discussing with Harsh that there was a google doc that had a lot of the work you had done back in July fully documented, can you please comment below with that doc? @harshsikka next step, we need to redefine the scope on this issue to match the updated NEKO model you have/are working on