-
### System Info
- CPU archtecture: x86_64
- CPU/Host memory size: 250GB total
- GPU properties
- GPU name: 2x NVIDIA A100 80GB
- GPU memory size: 160GB total
- Libraries
- tensorrt @ fi…
-
I'm trying to quantize TF-TRT INT 8 Model in Colab-TF-TRT-inference-from-Keras-saved-model.ipynb using Jupyter notebook.
I faced gpu out of memory error. but i think i have enough gpu memory.
~~~
…
-
### System Info
- GPU: 2xA100-40G
- TensorRT-LLM v0.8.0
### Who can help?
@Tracin
### Information
- [X] The official example scripts
- [ ] My own modified scripts
### Tasks
- [ ] An officia…
-
Hi @tridao, we recently implemented INT8 forward FMHA (8-bit Flash-Attention) with both static and dynamic quantization for Softmax on our GPGPU card, and achieved good results and relatively okay acc…
-
We plan to add QAT for LLMs to torchao (as mentioned in the original RFC here https://github.com/pytorch-labs/ao/issues/47)
For this to run efficiently on the GPU we'd need kernel support for W4A8…
-
The model downloaded from https://github.com/fatihcakirs/mobile_models/blob/main/v0_7/tflite/mobilebert_int8_384_20200602.tflite
Some Fully-connected weights has none-zero zero point (ex. weight `b…
-
Hi~ Great work there!
What I want to ask is whether RepGhost has suffered a serious loss after INT8 quantization?
Or how do you solve quantitative problems? Thanks~
-
Nils is it possible to create an integer only models so this could run on accelerators or frameworks such as ArmNN?
https://www.tensorflow.org/lite/performance/post_training_quantization#full_integer…
-
If I wanted to use Quantization Aware Training (QAT) in conjunction with structured hashing, should I quantize **before** or **after** FeatherMap?
i.e, (before, intuitively seems correct to me):
…
-
I am trying to deploy a Baichuan2-7B model on a machine with 2 Tesla V100 GPUs. Unfortunately each V100 has only 16GB memory.
I have applied INT8 weight-only quantization, so the size of the engine I…