bigcode-project / Megatron-LM

Ongoing research training transformer models at scale
Other
376 stars 49 forks source link

TF-Multi Node Training Layout #4

Closed harm-devries closed 1 year ago

harm-devries commented 2 years ago

For the model training it's important that we achieve a high throughput. We need to figure out the multi-node training configuration(i.e., what combination of data, tensor, and pipeline parallelism) for 192 V100 GPUs.