NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.31k stars 931 forks source link

How does auto_parallel work? #1471

Open Hudayday opened 5 months ago

Hudayday commented 5 months ago

Hi, I just noticed that auto_parallel support is available. However, after reviewing the code, I still have no idea how it determines the best configuration. It seems to solve the resharding cost graph as an ILP problem to find the lowest cost.

Is this correct? Does it apply to all models?

yuxianq commented 5 months ago

Yes, the AutoPP implementation is based on Alpa, you can find more details about how to model auto parallelization as an ILP problem.

Does it apply to all models?

You can find all supported native op in LAYER_TYPE_2_NODE_TYPE and all supported plugin in PLUGIN_LAYER_TYPE_2_NODE_TYPE in tensorrt_llm/auto_parallel/node_graph.py. Any model that only uses layers in this range can enable auto parallel. For example, LLaMA and GPT is supported, but Mixtral is unsupported since the MoE plugin is not included. We will add cost models for more plugins in the future versions.