[Model Requests] Add support for GLM-4 series

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Apache License 2.0

7.34k stars 794 forks source link

GLM-4 and GLM-4V are next-gen model of ChatGLM3 and CogVLM2, the model repository is here: https://github.com/THUDM/GLM-4/

GLM-4 model is very similar to ChatGLM3, only a slight modification is needed. https://github.com/THUDM/GLM-4/issues/132#issuecomment-2178031221

GLM-4V model is similar to CogVLM2(https://github.com/NVIDIA/TensorRT-LLM/issues/1644), just replace the language backbone to GLM-4 and remove the visual experts. It has better perfermance and even better accuracy,

Please add official support, I believe that TensorRT's blessing is a better choice for CUDA devices.

cc @ncomly-nvidia

NVIDIA / TensorRT-LLM

[Model Requests] Add support for GLM-4 series #1828