JF-D / Proteus

10 stars 3 forks source link

Discrepancy in predicted runtimes for some configs #3

Open hakesh729 opened 1 month ago

hakesh729 commented 1 month ago

Context: We are trying to get Proteus runtime predictions for V100 + 32 gpu load (4 nodes with 8 gpus per node) load for GPT3-2.7B model with different configurations involving PP-degree, TP-degree (aka MP deg), zero. We also made some changes to support higher PP degree in PP strategy and used macro-batches as microbatches etc. More details about changes we did will be mentioned in other comments.

Also, we added relevant H100 topo files to test proteus with H100 as well. So, please ignore those files and other external/nccl file changes.

With these changes to megatron_gpt code, we are observing a variation of 10x from actual runtime to predicted time. The megatron_gpt python script specifiying the config details are given in examples/fail_config1.sh and examples/fail_config2.sh files. Attached screenshot of observed result for one of this config:

image
AgrawalAmey commented 1 month ago

@JF-D Hakesh has been trying to obtain Proteus predictions for our paper. We have incorporated the fixes etc discussed over the email. The only additional changes here are related to support pipeline parallel (with virtual stages) and micro batching. Please share any inputs you have. We want to ensure that we are not introducing any errors with our changes. Thank you!

JF-D commented 1 month ago

Have you tried to check the trace dumped? Proteus will export a trace that can be visulized in chrome tracing by setting profile=True (here). From the trace you can check whether the partition and scheduling is as you expected.

Could you also provide a copy of the cost model profiling result? I don't have access to V100 or H100 GPUs currently.