What's the optimal parallel strategy using TensorRT-LLM?

databricks / dbrx

Code examples and resources for DBRX, a large language model developed by Databricks

https://www.databricks.com/

Other

2.47k stars 231 forks source link

What's the optimal parallel strategy using TensorRT-LLM? #8

Open iteratorlee opened 3 months ago

iteratorlee commented 3 months ago

Thanks for your great efforts first. I read the PR you opened in the TensorRT-LLM repo and noticed that EP +TP, PP + TP, and TP are supported during inference. May I ask which one is optimal? Specifically, as for the MoE layer, does EP or TP yield better performance?

hanlint commented 3 months ago

cc: @megha95

dskhudia commented 3 months ago

TP is better as at lower batch sizes it allows better load balance. At higher batch sizes, they should be similar. We haven't benchmarked EP yet.