[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Tensor parallelism is a a critical technique employed to train and inference from very large language models by splitting the actual computations/tensors across multiple compute devices.
Motivation
In our previous implementation on Xeon CPU, tensor parallelism(TP) can significantly reduce the latency on inference. <html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">
Prerequisites
Feature Description
Tensor parallelism is a a critical technique employed to train and inference from very large language models by splitting the actual computations/tensors across multiple compute devices.
Motivation
In our previous implementation on Xeon CPU, tensor parallelism(TP) can significantly reduce the latency on inference. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
model | precision | TP size | input_size | nex_token_time/ms -- | -- | -- | -- | -- llama2-70b | q4_j | 1 | 32 | 191.91 llama2-70b | q4_j | 2 | 32 | 120.87 llama2-70b | q4_j | 4 | 32 | 86.15 llama2-70b | q4_j | 1 | 1024 | 197.18 llama2-70b | q4_j | 2 | 1024 | 129.25 llama2-70b | q4_j | 4 | 1024 | 91.76 llama2-70b | q4_j | 1 | 2012 | 204.85 llama2-70b | q4_j | 2 | 2012 | 127.31 llama2-70b | q4_j | 4 | 2012 | 100.44