Quantizing MAmmoTH2-8B-Plus

vackosar commented 5 months ago

Hello, I tried quantizing this interesting model. Somehow the script is not producing anything. It seems to be crashing:

!git clone https://github.com/SolidRusT/srt-model-quantizing
!pip install -r srt-model-quantizing/awq/requirements.txt
!python srt-model-quantizing/awq/run-quant-awq.py --model_path TIGER-Lab/MAmmoTH2-8B-Plus --quant_path ./MAmmoTH2-8B-Plus-AWQ --zero_point True --q_group_size 128 --w_bit 4 --version GEMM

Output:

2024-05-08 04:28:41.023895: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-08 04:28:41.023964: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-08 04:28:41.140530: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-08 04:28:43.140468: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Fetching 12 files: 100% 12/12 [00:00<00:00, 67468.70it/s]
Loading checkpoint shards:  50% 2/4 [00:40<00:40, 20.42s/it]^C

vackosar commented 5 months ago

It was low RAM problem it seems. But now it fails in Colab due to some random tokenizer issue.

ValueError: Couldn't instantiate the backend tokenizer from one of: 
(1) a `tokenizers` library serialization file, 
(2) a slow tokenizer instance to convert or 
(3) an equivalent slow tokenizer class to instantiate and convert. 
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

suparious commented 4 months ago

Sorry for the delayed response here. I am currently trying to clean this up. It seems the memory limits are not currently working in AWQ and I'm working with CasperHansen's branch here to try an sort it, but it is a bit complicated for me.

I meant to put a tag on commit 7eddd98e9b4e9aea9deb4404dd88e8fd094ad737 as I started to move most of the BASH login into Python.

I also am trying to enable the automatic conversion of pytorch to safetensors in a folder called /version2 where I think we can have better control over the memory management.

suparious commented 4 months ago

Leaving this issue open until we can find an appropriate workflow of solution. I have similar issues with 24GB a10g GPU on gemma models and any model over 22B.

suparious commented 1 month ago

https://huggingface.co/solidrust/MAmmoTH2-8B-Plus-AWQ

SolidRusT / srt-model-quantizing

Quantizing MAmmoTH2-8B-Plus #2