How does auto_parallel work?

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Apache License 2.0

8.31k stars 931 forks source link

Yes, the AutoPP implementation is based on Alpa, you can find more details about how to model auto parallelization as an ILP problem.

Does it apply to all models?

You can find all supported native op in LAYER_TYPE_2_NODE_TYPE and all supported plugin in PLUGIN_LAYER_TYPE_2_NODE_TYPE in tensorrt_llm/auto_parallel/node_graph.py. Any model that only uses layers in this range can enable auto parallel. For example, LLaMA and GPT is supported, but Mixtral is unsupported since the MoE plugin is not included. We will add cost models for more plugins in the future versions.

NVIDIA / TensorRT-LLM

How does auto_parallel work? #1471