How to force pip install to build the CUDA extension?

I'm having a lot of problems getting AutoGPTQ compiled when using a Docker

I've tried:

RUN pip install auto-gptq==0.2.0

and

RUN /bin/bash -o pipefail -c 'cd /root && \
git clone https://github.com/PanQiWei/AutoGPTQ && \
cd AutoGPTQ && \
git checkout v0.2.0 && \
PATH=/usr/local/cuda/bin:"$PATH" TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" pip install .'

The second example worked before and it doesn't work now and I can't understand why.

The Docker template in question has 11.6 installed:

root@a62a92b90c6a:/# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

If I boot into the Docker and compile from the command line, it works fine:

root@a62a92b90c6a:~# git clone https://github.com/PanQiWei/AutoGPTQ
Cloning into 'AutoGPTQ'...
remote: Enumerating objects: 2128, done.
remote: Counting objects: 100% (466/466), done.
remote: Compressing objects: 100% (262/262), done.
remote: Total 2128 (delta 291), reused 243 (delta 193), pack-reused 1662
Receiving objects: 100% (2128/2128), 7.41 MiB | 18.15 MiB/s, done.
Resolving deltas: 100% (1415/1415), done.
root@a62a92b90c6a:~# cd AutoGPTQ/
root@a62a92b90c6a:~/AutoGPTQ# git checkout v0.2.0
Note: switching to 'v0.2.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 6a37f7c update setup.py
root@a62a92b90c6a:~/AutoGPTQ# pip install .
Processing /root/AutoGPTQ
  Preparing metadata (setup.py) ... done
Requirement already satisfied: accelerate>=0.19.0 in /usr/local/lib/python3.10/dist-packages (from auto-gptq==0.2.0+cu1162) (0.20.0.dev0)
Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (from auto-gptq==0.2.0+cu1162) (2.12.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from auto-gptq==0.2.0+cu1162) (1.24.2)
Requirement already satisfied: rouge in /usr/local/lib/python3.10/dist-packages (from auto-gptq==0.2.0+cu1162) (1.0.1)
Requirement already satisfied: torch>=1.13.0 in /usr/local/lib/python3.10/dist-packages (from auto-gptq==0.2.0+cu1162) (2.0.0)
Requirement already satisfied: safetensors in /usr/local/lib/python3.10/dist-packages (from auto-gptq==0.2.0+cu1162) (0.3.1)
Requirement already satisfied: transformers>=4.26.1 in /usr/local/lib/python3.10/dist-packages (from auto-gptq==0.2.0+cu1162) (4.30.0.dev0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.19.0->auto-gptq==0.2.0+cu1162) (23.0)
Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.19.0->auto-gptq==0.2.0+cu1162) (5.9.4)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.19.0->auto-gptq==0.2.0+cu1162) (6.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (3.10.7)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (4.5.0)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (1.11.1)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (3.0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (3.1.2)
Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.7.99 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (11.7.99)
Requirement already satisfied: nvidia-cuda-runtime-cu11==11.7.99 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (11.7.99)
Requirement already satisfied: nvidia-cuda-cupti-cu11==11.7.101 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (11.7.101)
Requirement already satisfied: nvidia-cudnn-cu11==8.5.0.96 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (8.5.0.96)
Requirement already satisfied: nvidia-cublas-cu11==11.10.3.66 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (11.10.3.66)
Requirement already satisfied: nvidia-cufft-cu11==10.9.0.58 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (10.9.0.58)
Requirement already satisfied: nvidia-curand-cu11==10.2.10.91 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (10.2.10.91)
Requirement already satisfied: nvidia-cusolver-cu11==11.4.0.1 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (11.4.0.1)
Requirement already satisfied: nvidia-cusparse-cu11==11.7.4.91 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (11.7.4.91)
Requirement already satisfied: nvidia-nccl-cu11==2.14.3 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (2.14.3)
Requirement already satisfied: nvidia-nvtx-cu11==11.7.91 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (11.7.91)
Requirement already satisfied: triton==2.0.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq==0.2.0+cu1162) (2.0.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=1.13.0->auto-gptq==0.2.0+cu1162) (67.6.1)
Requirement already satisfied: wheel in /usr/local/lib/python3.10/dist-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=1.13.0->auto-gptq==0.2.0+cu1162) (0.40.0)
Requirement already satisfied: cmake in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch>=1.13.0->auto-gptq==0.2.0+cu1162) (3.26.1)
Requirement already satisfied: lit in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch>=1.13.0->auto-gptq==0.2.0+cu1162) (16.0.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.14.1 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.26.1->auto-gptq==0.2.0+cu1162) (0.14.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.26.1->auto-gptq==0.2.0+cu1162) (2023.5.4)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers>=4.26.1->auto-gptq==0.2.0+cu1162) (2.28.2)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.26.1->auto-gptq==0.2.0+cu1162) (0.13.3)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.26.1->auto-gptq==0.2.0+cu1162) (4.65.0)
Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq==0.2.0+cu1162) (12.0.0)
Requirement already satisfied: dill<0.3.7,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq==0.2.0+cu1162) (0.3.6)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq==0.2.0+cu1162) (2.0.1)
Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq==0.2.0+cu1162) (3.2.0)
Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq==0.2.0+cu1162) (0.70.14)
Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq==0.2.0+cu1162) (2023.4.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq==0.2.0+cu1162) (3.8.4)
Requirement already satisfied: responses<0.19 in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq==0.2.0+cu1162) (0.18.0)
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from rouge->auto-gptq==0.2.0+cu1162) (1.14.0)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->auto-gptq==0.2.0+cu1162) (22.2.0)
Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->auto-gptq==0.2.0+cu1162) (3.1.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->auto-gptq==0.2.0+cu1162) (6.0.4)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->auto-gptq==0.2.0+cu1162) (4.0.2)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->auto-gptq==0.2.0+cu1162) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->auto-gptq==0.2.0+cu1162) (1.3.3)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->auto-gptq==0.2.0+cu1162) (1.3.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/lib/python3/dist-packages (from requests->transformers>=4.26.1->auto-gptq==0.2.0+cu1162) (2.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.26.1->auto-gptq==0.2.0+cu1162) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests->transformers>=4.26.1->auto-gptq==0.2.0+cu1162) (2019.11.28)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.13.0->auto-gptq==0.2.0+cu1162) (2.1.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->auto-gptq==0.2.0+cu1162) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->auto-gptq==0.2.0+cu1162) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->auto-gptq==0.2.0+cu1162) (2023.3)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.13.0->auto-gptq==0.2.0+cu1162) (1.3.0)
Building wheels for collected packages: auto-gptq
  Building wheel for auto-gptq (setup.py) ... done
  Created wheel for auto-gptq: filename=auto_gptq-0.2.0+cu1162-cp310-cp310-linux_x86_64.whl size=3637006 sha256=84f5263e347cc5199923597b654f994ea35f1f0ea586ae81f5be94984c892b3f
  Stored in directory: /tmp/pip-ephem-wheel-cache-q1oqlde6/wheels/24/88/75/0af9bf8f82c28467ed0e61e1ded8572458d43b390028b42ccb
Successfully built auto-gptq
Installing collected packages: auto-gptq
  Attempting uninstall: auto-gptq
    Found existing installation: auto-gptq 0.2.0+cu1162
    Uninstalling auto-gptq-0.2.0+cu1162:
      Successfully uninstalled auto-gptq-0.2.0+cu1162
Successfully installed auto-gptq-0.2.0+cu1162
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
root@a62a92b90c6a:~/AutoGPTQ#

In general I found AutoGPTQ seems to be very particular about whether or not it will build the CUDA kernel

Is there some command I can give to force it to build it? It would be really helpful.

Thanks very much

I also tried adding this:

RUN /bin/bash -o pipefail -c 'cd /root/AutoGPTQ && PATH=/usr/local/cuda/bin:"$PATH" TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" BUILD_CUDA_EXT=1 python setup.py install'

But it's still not building the kernel:

logs:
WARNING:CUDA extension not installed.
WARNING:The safetensors archive passed at models/TheBloke_Samantha-7B-GPTQ/Samantha-7B-GPTQ-4bit-128g.no-act-order.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.

no kernel:

root@5a593abbb4cd:/# find /usr/local/lib/python3.10 -name "*autogptq_cuda*"
root@5a593abbb4cd:/#

I worked around it by building a wheel while logged into the Docker, then doing this in the Dockerfile:

RUN pip install https://github.com/TheBlokeAI/wheel/raw/master/auto_gptq-0.2.0%2Bcu1162-cp310-cp310-linux_x86_64.whl

But I would love to understand how to do this properly, building from Dockerfile - and what's causing it not to work right now.

I've not tried building wheels in docker environment, I might try it this weekend. I guess its because in docker cuda can't be detected properly, maybe you can try torch.cuda.is_available() first?

I've not tried building wheels in docker environment, I might try it this weekend. I guess its because in docker cuda can't be detected properly, maybe you can try torch.cuda.is_available() first?

I have the same issue and I have confirmed that command returns True.

nvidia/cuda:11.8.0-devel-ubuntu22.04 is a great base image to test this. I'd assume it's one of the most used base images for ML containers.

I'm stuck on the same. I'm trying to one of your, Tom, models using modal.com service. They use Docker images as well.

I'm using this snippet to build the image and keep having the huge red warning "CUDA Extension not installed".


IMAGE_MODEL_DIR = "/model"
MODEL_BASE_FILE = "Wizard-Vicuna-13B-Uncensored-GPTQ-4bit-128g.compat.no-act-order"

def download_model():
    from huggingface_hub import snapshot_download
    MODEL_NAME = "TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ"
    snapshot_download(MODEL_NAME, local_dir=IMAGE_MODEL_DIR)

stub.image = (
    Image.from_dockerhub(
        "nvidia/cuda:11.7.0-devel-ubuntu20.04",
        setup_dockerfile_commands=[
            "RUN apt-get update",
            "RUN apt-get install -y python3 python3-pip python-is-python3",

        ],
    )
    .apt_install("git", "gcc", "build-essential")
    .pip_install(
        "huggingface_hub",
        "transformers",
        "torch",
        "auto-gptq",
        "einops",
    )
    .run_function(download_model)
)

I find myself even more confused that README mentions that we may want to disable CUDA extensions. It seems that having them come in the first place is a challenge so I wonder why would someone want to disable them...

Yeah this has been quite a pain. I was stuck with this problem yesterday on a Lambda Labs box.

I eventually solved it by installing miniconda, creating a new environment with torch2 etc, and then building from source in there.

I just can't understand what it's checking for and why it fails.

I did notice something interesting: during the execution of setup.py, sys.path is set to something that doesn't include the normal python site-packages directory:

I hacked setup.py to have this code:

try:
    print("BEFORE BEFORE BEFORE")
    print(sys.path)
    import torch
    print("AFTER AFTER AFTER")
    TORCH_AVAILABLE = True
except ImportError:
    TORCH_AVAILABLE = False

And confirmed that:

It never reached "AFTER.." because it fails on import torch
sys.path was different to what it should be.

Here's some logs I recorded last night:

On the command line, all is fine:

[pytorch2] tomj@a10:/workspace/git/AutoGPTQ git:(main*) $ python
Python 3.10.11 (main, Apr  5 2023, 14:15:10) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path
['', '/usr/lib/python310.zip', '/usr/lib/python3.10', '/usr/lib/python3.10/lib-dynload', '/workspace/venv/pytorch2/lib/python3.10/site-packages', '/workspace/venv/pytorch2/lib/python3.10/site-packages/quant_cuda-0.0.0-py3.10-linux-x86_64.egg']
>>> import torch
>>>

But in setup.py, I could not import torch and sys.path is different:

[pytorch2] tomj@a10:/workspace/git/AutoGPTQ git:(main*) $ pip install -v .
Using pip 23.1.2 from /workspace/venv/pytorch2/lib/python3.10/site-packages/pip (python 3.10)
Processing /workspace/git/AutoGPTQ
  Running command pip subprocess to install build dependencies
  Collecting setuptools>=40.8.0
    Using cached setuptools-67.8.0-py3-none-any.whl (1.1 MB)
  Collecting wheel
    Using cached wheel-0.40.0-py3-none-any.whl (64 kB)
  Installing collected packages: wheel, setuptools
  Successfully installed setuptools-67.8.0 wheel-0.40.0
  Installing build dependencies ... done
  Running command Getting requirements to build wheel
  BEFORE BEFORE BEFORE
  ['/workspace/git/AutoGPTQ', '/workspace/venv/pytorch2/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process', '/tmp/pip-build-env-f0hzsznx/site', '/usr/lib/python310.zip', '/usr/lib/python3.10', '/usr/lib/python3.10/lib-dynload', '/tmp/pip-build-env-f0hzsznx/overlay/lib/python3.10/site-packages', '/tmp/pip-build-env-f0hzsznx/normal/lib/python3.10/site-packages']
  running egg_info
  writing auto_gptq.egg-info/PKG-INFO
  writing dependency_links to auto_gptq.egg-info/dependency_links.txt
  writing requirements to auto_gptq.egg-info/requires.txt
  writing top-level names to auto_gptq.egg-info/top_level.txt
  reading manifest file 'auto_gptq.egg-info/SOURCES.txt'
  adding license file 'LICENSE'
  writing manifest file 'auto_gptq.egg-info/SOURCES.txt'
  Getting requirements to build wheel ... done
  Running command Preparing metadata (pyproject.toml)
  BEFORE BEFORE BEFORE
  ['/workspace/git/AutoGPTQ', '/workspace/venv/pytorch2/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process', '/tmp/pip-build-env-f0hzsznx/site', '/usr/lib/python310.zip', '/usr/lib/python3.10', '/usr/lib/python3.10/lib-dynload', '/tmp/pip-build-env-f0hzsznx/overlay/lib/python3.10/site-packages', '/tmp/pip-build-env-f0hzsznx/normal/lib/python3.10/site-packages']
  running dist_info

So this is sys.path on the command line:

['', '/usr/lib/python310.zip', 
'/usr/lib/python3.10', 
'/usr/lib/python3.10/lib-dynload', 
'/workspace/venv/pytorch2/lib/python3.10/site-packages', 
'/workspace/venv/pytorch2/lib/python3.10/site-packages/quant_cuda-0.0.0-py3.10-linux-x86_64.egg']

And this is sys.path during setup.py :

  ['/workspace/git/AutoGPTQ', 
'/workspace/venv/pytorch2/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process', 
'/tmp/pip-build-env-f0hzsznx/site', 
'/usr/lib/python310.zip',
 '/usr/lib/python3.10', 
'/usr/lib/python3.10/lib-dynload',
 '/tmp/pip-build-env-f0hzsznx/overlay/lib/python3.10/site-packages', '
/tmp/pip-build-env-f0hzsznx/normal/lib/python3.10/site-packages']

It is missing /workspace/venv/pytorch2/lib/python3.10/site-packages from my venv?

Could this be part of the problem? But why? I don't really understand the setup.py process at the moment.

In the end I got it working in conda with a pip install . source code install. But of course my method may not work for you on modal, Luke :(

@PanQiWei we'd really appreciate if you could investigate this as it's causing problems for quite a few people. Let me know if you can't re-create it and I can provide you with an SSH login to a system that demonstrates the problem.

Thank you @TheBloke. I used your suggestion to clone source code and pip install . effectively editing my code to:

IMAGE_MODEL_DIR = "/model"
MODEL_BASE_FILE = "Wizard-Vicuna-13B-Uncensored-GPTQ-4bit-128g.compat.no-act-order"

def download_model():
    from huggingface_hub import snapshot_download
    MODEL_NAME = "TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ"
    snapshot_download(MODEL_NAME, local_dir=IMAGE_MODEL_DIR)

stub.image = (
    Image.from_dockerhub(
        "nvidia/cuda:11.7.0-devel-ubuntu20.04",
        setup_dockerfile_commands=[
            "RUN apt-get update",
            "RUN apt-get install -y python3 python3-pip python-is-python3",
        ],
    )
    # it also reaches the same output with debian slim image if we were to swap out the base image
    # Image.debian_slim(python_version="3.10")
    .apt_install("git", "gcc", "build-essential")
    .run_commands(
        "git clone https://github.com/PanQiWei/AutoGPTQ.git",
        "cd AutoGPTQ && pip install -e .",
    )
    .pip_install(
        "huggingface_hub",
        "transformers",
        "torch",
        "einops",
    )
    .run_function(download_model)
)

But now I get: AttributeError: module 'autogptq_cuda' has no attribute 'vecquant4matmul_faster_old' which keeps me wondering if it was false sense of progress and I'm actually missing "CUDA-something" underneath.

I def hope William can help us out and shed some light on the challenge.

Although you know what, Tom, I've been looking at this example from modal guys at Falcon. https://github.com/modal-labs/modal-examples/blob/main/06_gpu_and_ml/falcon_gptq.py

After you explained that you used latest AutoGPTQ to quantizize it makes sense why I had "model not found in path" issues after I tried to swap out to your previously produced quantizations. But what I don't understand is how this code doesn't get CUDE extension missing error. Perhaps you can notice something.

So I wanted to post an update since I managed to get it working last night. I still don't understand the magic to its roots but there were two catches.

Running the example here https://github.com/PanQiWei/AutoGPTQ/pull/91#issuecomment-1555227948 in dedicated GPU server helped me out a lot to verify that I can get it working with my desired images and the way I install the package. In fact it failed to compile wheel when installing through pip.
Using CUDA image is not enough but seems to be prerequisite.
I need to git clone and pip install . this repository to get it compile CUDA extensions.
When pip installing it IS VERY IMPORTANT Docker sees GPU. It seems that with docker if we build images on a platform without GPUs (I'm using MacBook, Modal.com uses bare CPUxRAM servers) it will cause weird errors such as the one above with vecquant4matmul_faster_old.

As such I'm now using the following snippet to build the image:

IMAGE_MODEL_DIR = "/model"
MODEL_BASE_FILE = "Wizard-Vicuna-13B-Uncensored-GPTQ-4bit-128g.compat.no-act-order"

def download_model():
    from huggingface_hub import snapshot_download
    MODEL_NAME = "TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ"
    snapshot_download(MODEL_NAME, local_dir=IMAGE_MODEL_DIR)

stub.image = (
    Image.from_dockerhub(
        "nvidia/cuda:11.7.0-devel-ubuntu20.04",
        setup_dockerfile_commands=[
            "RUN apt-get update",
            "RUN apt-get install -y python3 python3-pip python-is-python3",
        ],
    )
    .apt_install("git", "gcc", "build-essential")
    .run_commands(
        "git clone https://github.com/PanQiWei/AutoGPTQ.git",
        "cd AutoGPTQ && pip install -e .",
        gpu="A10G",
    )
    .pip_install(
        "huggingface_hub",
        "transformers",
        "torch",
        "einops",
    )
    .run_function(download_model)
)

Notice the gpu parameter I put when running pip command.

Down the road I will need to build image for other services so I will need to figure out how to fake or force it to build in the right way and that is a huge blank spot in my brain. A great opportunity to learn something new down the road.

@TheBloke I'd kindly suggest for you to look into the option 3 that helped me out and look into docker flag --gpus all as well as this https://stackoverflow.com/questions/59691207/docker-build-with-nvidia-runtime.

One more thing that I'm lost is why I cannot use any newer CUDA image because trying to use one I'm being screamed with "pytorch has been compiled with CUDA 11.7.0 version". Which I cannot understand in which step exactly that happens...

If you pull the 0.2.1 source from GH and try to compile with CUDA, the issue is not Torch, but https://github.com/PanQiWei/AutoGPTQ/blob/main/setup.py#L13 which is IN_GITHUB_ACTIONS = os.environ.get("GITHUB_ACTIONS", "false") == "true"

I guess that's set so that CUDA support is automatically compiled in GH Wheel releases, but the rest of us will never get it compiled from the source, as we don't use GH Actions locally. :)

Replacing that line with IN_GITHUB_ACTIONS = True will get the CUDA extension compiled, as long as you have CUDA_VERSION=xxx set in your environment, e.g. export CUDA_VERSION=118

Using CUDA 12.1 here in WSL2, and issuing pip install .[triton] I am able to compile the CUDA extension as well as use Triton and get ~2.5 t/s on my 3090 with TheBloke/WizardLM-Uncensored-Falcon-40B model.

Otherwise, I can barely scrape 1 t/s after a warm-up, but generally hovers around 0.7 t/s on default install of AutoGPTQ without CUDA or Triton.

Aside, pip install auto-gptq fails to compile the CUDA extension here as well, returns an error:

running build_ext
      /home/user/Envs/text-generation-webui_env/lib/python3.10/site-packages/torch/utils/cpp_extension.py:399: UserWarning: There are no x86_64-linux-gnu-g++ version bounds defined for CUDA version 12.1
        warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
      building 'autogptq_cuda' extension
      creating /tmp/pip-install-p56cxy8z/auto-gptq_4e922a3ed443469cb66ca8aca66e0719/build/temp.linux-x86_64-cpython-310
      creating /tmp/pip-install-p56cxy8z/auto-gptq_4e922a3ed443469cb66ca8aca66e0719/build/temp.linux-x86_64-cpython-310/autogptq_cuda
      Emitting ninja build file /tmp/pip-install-p56cxy8z/auto-gptq_4e922a3ed443469cb66ca8aca66e0719/build/temp.linux-x86_64-cpython-310/build.ninja...
      Compiling objects...
      Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
      ninja: error: '/tmp/pip-install-p56cxy8z/auto-gptq_4e922a3ed443469cb66ca8aca66e0719/autogptq_cuda/autogptq_cuda.cpp', needed by '/tmp/pip-install-p56cxy8z/auto-gptq_4e922a3ed443469cb66ca8aca66e0719/build/temp.linux-x86_64-cpython-310/autogptq_cuda/autogptq_cuda.o', missing and no known rule to make it
      Traceback (most recent call last):
        File "/home/user/Envs/text-generation-webui_env/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
          subprocess.run(
        File "/usr/lib/python3.10/subprocess.py", line 524, in run
          raise CalledProcessError(retcode, process.args,
      subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

If you pull the 0.2.1 source from GH and try to compile with CUDA, the issue is not Torch, but https://github.com/PanQiWei/AutoGPTQ/blob/main/setup.py#L13 which is IN_GITHUB_ACTIONS = os.environ.get("GITHUB_ACTIONS", "false") == "true"

Actually I'm confused since I did exactly that:

.run_commands(
        "git clone https://github.com/PanQiWei/AutoGPTQ.git",
        "cd AutoGPTQ && pip install -e .",
        gpu="A10G",
    )

and it works. Reading that part of the code seems to me that it affects the printed version but it should still try compiling down the road. I'm confused however.

I hope @PanQiWei can help us out!

If you pull the 0.2.1 source from GH and try to compile with CUDA, the issue is not Torch, but https://github.com/PanQiWei/AutoGPTQ/blob/main/setup.py#L13 which is IN_GITHUB_ACTIONS = os.environ.get("GITHUB_ACTIONS", "false") == "true"

I guess that's set so that CUDA support is automatically compiled in GH Wheel releases, but the rest of us will never get it compiled from the source, as we don't use GH Actions locally. :)

Replacing that line with IN_GITHUB_ACTIONS = True will get the CUDA extension compiled, as long as you have CUDA_VERSION=xxx set in your environment, e.g. export CUDA_VERSION=118

and it works. Reading that part of the code seems to me that it affects the printed version but it should still try compiling down the road. I'm confused however.

Yeah I'm not sure that's correct. Because on some systems I definitely can build the CUDA extension with pip install . - without editing setup.py, and without CUDA_VERSION set - on some systems.

So I think it is Torch related somehow.

For example, testing on Runpod using their runpod/pytorch:3.10-2.0.1-117-devel Docker which has torch 2.0.1+cu117 with CUDA toolkit 11.7:

root@005505d3451a:~# which nvcc
/usr/local/cuda/bin/nvcc

root@005505d3451a:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

root@005505d3451a:~# python
Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>>

Install from PyPi fails with strange error, not one I've seen before:

root@005505d3451a:~# pip install -v --no-cache-dir auto-gptq
Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Collecting auto-gptq
  Downloading auto_gptq-0.2.1.tar.gz (48 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48.0/48.0 kB 1.6 MB/s eta 0:00:00
  Running command python setup.py egg_info
  running egg_info
  creating /tmp/pip-pip-egg-info-ppi_pk6y/auto_gptq.egg-info
  writing /tmp/pip-pip-egg-info-ppi_pk6y/auto_gptq.egg-info/PKG-INFO
  writing dependency_links to /tmp/pip-pip-egg-info-ppi_pk6y/auto_gptq.egg-info/dependency_links.txt
  writing requirements to /tmp/pip-pip-egg-info-ppi_pk6y/auto_gptq.egg-info/requires.txt
  writing top-level names to /tmp/pip-pip-egg-info-ppi_pk6y/auto_gptq.egg-info/top_level.txt
  writing manifest file '/tmp/pip-pip-egg-info-ppi_pk6y/auto_gptq.egg-info/SOURCES.txt'
  /usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
    warnings.warn(msg.format('we could not find ninja.'))
  reading manifest file '/tmp/pip-pip-egg-info-ppi_pk6y/auto_gptq.egg-info/SOURCES.txt'
  adding license file 'LICENSE'
  writing manifest file '/tmp/pip-pip-egg-info-ppi_pk6y/auto_gptq.egg-info/SOURCES.txt'
  Preparing metadata (setup.py) ... done
Requirement already satisfied: accelerate>=0.19.0 in /usr/local/lib/python3.10/dist-packages (from auto-gptq) (0.19.0)
Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (from auto-gptq) (2.12.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from auto-gptq) (1.24.1)
Requirement already satisfied: rouge in /usr/local/lib/python3.10/dist-packages (from auto-gptq) (1.0.1)
Requirement already satisfied: torch>=1.13.0 in /usr/local/lib/python3.10/dist-packages (from auto-gptq) (2.0.1+cu117)
Requirement already satisfied: safetensors in /usr/local/lib/python3.10/dist-packages (from auto-gptq) (0.3.1)
Requirement already satisfied: transformers>=4.26.1 in /usr/local/lib/python3.10/dist-packages (from auto-gptq) (4.29.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.19.0->auto-gptq) (23.1)
Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.19.0->auto-gptq) (5.9.5)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.19.0->auto-gptq) (6.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq) (3.9.0)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq) (4.4.0)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq) (1.11.1)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq) (3.0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq) (3.1.2)
Requirement already satisfied: triton==2.0.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->auto-gptq) (2.0.0)
Requirement already satisfied: cmake in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch>=1.13.0->auto-gptq) (3.25.0)
Requirement already satisfied: lit in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch>=1.13.0->auto-gptq) (15.0.7)
Requirement already satisfied: huggingface-hub<1.0,>=0.14.1 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.26.1->auto-gptq) (0.15.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.26.1->auto-gptq) (2023.6.3)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers>=4.26.1->auto-gptq) (2.28.1)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.26.1->auto-gptq) (0.13.3)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.26.1->auto-gptq) (4.65.0)
Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq) (12.0.0)
Requirement already satisfied: dill<0.3.7,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq) (0.3.6)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq) (2.0.2)
Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq) (3.2.0)
Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq) (0.70.14)
Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq) (2023.5.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq) (3.8.4)
Requirement already satisfied: responses<0.19 in /usr/local/lib/python3.10/dist-packages (from datasets->auto-gptq) (0.18.0)
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from rouge->auto-gptq) (1.16.0)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->auto-gptq) (23.1.0)
Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->auto-gptq) (2.1.1)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->auto-gptq) (6.0.4)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->auto-gptq) (4.0.2)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->auto-gptq) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->auto-gptq) (1.3.3)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->auto-gptq) (1.3.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.26.1->auto-gptq) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.26.1->auto-gptq) (1.26.13)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.26.1->auto-gptq) (2022.12.7)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.13.0->auto-gptq) (2.1.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->auto-gptq) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->auto-gptq) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->auto-gptq) (2023.3)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.13.0->auto-gptq) (1.2.1)
Building wheels for collected packages: auto-gptq
  Running command python setup.py bdist_wheel
  running bdist_wheel
  /usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
    warnings.warn(msg.format('we could not find ninja.'))
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-cpython-310
  creating build/lib.linux-x86_64-cpython-310/auto_gptq
  copying auto_gptq/__init__.py -> build/lib.linux-x86_64-cpython-310/auto_gptq
  creating build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks
  copying auto_gptq/eval_tasks/__init__.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks
  copying auto_gptq/eval_tasks/_base.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks
  copying auto_gptq/eval_tasks/language_modeling_task.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks
  copying auto_gptq/eval_tasks/sequence_classification_task.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks
  copying auto_gptq/eval_tasks/text_summarization_task.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks
  creating build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  copying auto_gptq/modeling/__init__.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  copying auto_gptq/modeling/_base.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  copying auto_gptq/modeling/_const.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  copying auto_gptq/modeling/_utils.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  copying auto_gptq/modeling/auto.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  copying auto_gptq/modeling/bloom.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  copying auto_gptq/modeling/codegen.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  copying auto_gptq/modeling/gpt2.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  copying auto_gptq/modeling/gpt_bigcode.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  copying auto_gptq/modeling/gpt_neox.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  copying auto_gptq/modeling/gptj.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  copying auto_gptq/modeling/llama.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  copying auto_gptq/modeling/moss.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  copying auto_gptq/modeling/opt.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  copying auto_gptq/modeling/rw.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/modeling
  creating build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules
  copying auto_gptq/nn_modules/__init__.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules
  copying auto_gptq/nn_modules/_fused_base.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules
  copying auto_gptq/nn_modules/fused_gptj_attn.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules
  copying auto_gptq/nn_modules/fused_llama_attn.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules
  copying auto_gptq/nn_modules/fused_llama_mlp.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules
  copying auto_gptq/nn_modules/qlinear.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules
  copying auto_gptq/nn_modules/qlinear_old.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules
  copying auto_gptq/nn_modules/qlinear_triton.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules
  creating build/lib.linux-x86_64-cpython-310/auto_gptq/quantization
  copying auto_gptq/quantization/__init__.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/quantization
  copying auto_gptq/quantization/gptq.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/quantization
  copying auto_gptq/quantization/quantizer.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/quantization
  creating build/lib.linux-x86_64-cpython-310/auto_gptq/utils
  copying auto_gptq/utils/__init__.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/utils
  copying auto_gptq/utils/data_utils.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/utils
  copying auto_gptq/utils/import_utils.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/utils
  creating build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks/_utils
  copying auto_gptq/eval_tasks/_utils/__init__.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks/_utils
  copying auto_gptq/eval_tasks/_utils/classification_utils.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks/_utils
  copying auto_gptq/eval_tasks/_utils/generation_utils.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks/_utils
  creating build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/triton_utils
  copying auto_gptq/nn_modules/triton_utils/__init__.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/triton_utils
  copying auto_gptq/nn_modules/triton_utils/custom_autotune.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/triton_utils
  copying auto_gptq/nn_modules/triton_utils/kernels.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/triton_utils
  copying auto_gptq/nn_modules/triton_utils/mixin.py -> build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/triton_utils
  running build_ext
  building 'autogptq_cuda' extension
  creating build/temp.linux-x86_64-cpython-310
  creating build/temp.linux-x86_64-cpython-310/autogptq_cuda
  x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -Iautogptq_cuda -I/usr/include/python3.10 -c autogptq_cuda/autogptq_cuda.cpp -o build/temp.linux-x86_64-cpython-310/autogptq_cuda/autogptq_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=autogptq_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  cc1plus: fatal error: autogptq_cuda/autogptq_cuda.cpp: No such file or directory
  compilation terminated.
  error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /usr/bin/python -u -c '
  exec(compile('"'"''"'"''"'"'
  # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
  #
  # - It imports setuptools before invoking setup.py, to enable projects that directly
  #   import from `distutils.core` to work with newer packaging standards.
  # - It provides a clear error message when setuptools is not installed.
  # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
  #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
  #     manifest_maker: standard file '"'"'-c'"'"' not found".
  # - It generates a shim setup.py, for handling setup.cfg-only projects.
  import os, sys, tokenize

  try:
      import setuptools
  except ImportError as error:
      print(
          "ERROR: Can not execute `setup.py` since setuptools is not available in "
          "the build environment.",
          file=sys.stderr,
      )
      sys.exit(1)

  __file__ = %r
  sys.argv[0] = __file__

  if os.path.exists(__file__):
      filename = __file__
      with tokenize.open(__file__) as f:
          setup_py_code = f.read()
  else:
      filename = "<auto-generated setuptools caller>"
      setup_py_code = "from setuptools import setup; setup()"

  exec(compile(setup_py_code, filename, "exec"))
  '"'"''"'"''"'"' % ('"'"'/tmp/pip-install-cz_mho_b/auto-gptq_76eb05edd3f1497d8ba64859a8374a37/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' bdist_wheel -d /tmp/pip-wheel-i5uzbege
  cwd: /tmp/pip-install-cz_mho_b/auto-gptq_76eb05edd3f1497d8ba64859a8374a37/
  Building wheel for auto-gptq (setup.py) ... error
  ERROR: Failed building wheel for auto-gptq
  Running setup.py clean for auto-gptq
  Running command python setup.py clean
  running clean
  removing 'build/temp.linux-x86_64-cpython-310' (and everything under it)
  removing 'build/lib.linux-x86_64-cpython-310' (and everything under it)
  'build/bdist.linux-x86_64' does not exist -- can't clean it
  'build/scripts-3.10' does not exist -- can't clean it
  removing 'build'
Failed to build auto-gptq
ERROR: Could not build wheels for auto-gptq, which is required to install pyproject.toml-based projects

Install from source works great:

root@005505d3451a:~# git clone https://github.com/PanQiWei/AutoGPTQ
Cloning into 'AutoGPTQ'...
remote: Enumerating objects: 2159, done.
remote: Counting objects: 100% (490/490), done.
remote: Compressing objects: 100% (239/239), done.
remote: Total 2159 (delta 312), reused 316 (delta 240), pack-reused 1669
Receiving objects: 100% (2159/2159), 7.41 MiB | 3.18 MiB/s, done.
Resolving deltas: 100% (1442/1442), done.
root@005505d3451a:~# cd AutoGPTQ/
root@005505d3451a:~/AutoGPTQ# pip install -v .
Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Processing /root/AutoGPTQ
  Running command python setup.py egg_info
  running egg_info
  creating /tmp/pip-pip-egg-info-9232g2ro/auto_gptq.egg-info
  writing /tmp/pip-pip-egg-info-9232g2ro/auto_gptq.egg-info/PKG-INFO
  writing dependency_links to /tmp/pip-pip-egg-info-9232g2ro/auto_gptq.egg-info/dependency_links.txt
  writing requirements to /tmp/pip-pip-egg-info-9232g2ro/auto_gptq.egg-info/requires.txt
  writing top-level names to /tmp/pip-pip-egg-info-9232g2ro/auto_gptq.egg-info/top_level.txt
  writing manifest file '/tmp/pip-pip-egg-info-9232g2ro/auto_gptq.egg-info/SOURCES.txt'
  /usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.

....

  running install
  running install_lib
  creating build/bdist.linux-x86_64
  creating build/bdist.linux-x86_64/wheel
  creating build/bdist.linux-x86_64/wheel/auto_gptq
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/__init__.py -> build/bdist.linux-x86_64/wheel/auto_gptq
  creating build/bdist.linux-x86_64/wheel/auto_gptq/eval_tasks
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks/__init__.py -> build/bdist.linux-x86_64/wheel/auto_gptq/eval_tasks
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks/_base.py -> build/bdist.linux-x86_64/wheel/auto_gptq/eval_tasks
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks/language_modeling_task.py -> build/bdist.linux-x86_64/wheel/auto_gptq/eval_tasks
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks/sequence_classification_task.py -> build/bdist.linux-x86_64/wheel/auto_gptq/eval_tasks
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks/text_summarization_task.py -> build/bdist.linux-x86_64/wheel/auto_gptq/eval_tasks
  creating build/bdist.linux-x86_64/wheel/auto_gptq/eval_tasks/_utils
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks/_utils/__init__.py -> build/bdist.linux-x86_64/wheel/auto_gptq/eval_tasks/_utils
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks/_utils/classification_utils.py -> build/bdist.linux-x86_64/wheel/auto_gptq/eval_tasks/_utils
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/eval_tasks/_utils/generation_utils.py -> build/bdist.linux-x86_64/wheel/auto_gptq/eval_tasks/_utils
  creating build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/modeling/__init__.py -> build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/modeling/_base.py -> build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/modeling/_const.py -> build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/modeling/_utils.py -> build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/modeling/auto.py -> build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/modeling/bloom.py -> build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/modeling/codegen.py -> build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/modeling/gpt2.py -> build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/modeling/gpt_bigcode.py -> build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/modeling/gpt_neox.py -> build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/modeling/gptj.py -> build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/modeling/llama.py -> build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/modeling/moss.py -> build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/modeling/opt.py -> build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/modeling/rw.py -> build/bdist.linux-x86_64/wheel/auto_gptq/modeling
  creating build/bdist.linux-x86_64/wheel/auto_gptq/nn_modules
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/__init__.py -> build/bdist.linux-x86_64/wheel/auto_gptq/nn_modules
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/_fused_base.py -> build/bdist.linux-x86_64/wheel/auto_gptq/nn_modules
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/fused_gptj_attn.py -> build/bdist.linux-x86_64/wheel/auto_gptq/nn_modules
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/fused_llama_attn.py -> build/bdist.linux-x86_64/wheel/auto_gptq/nn_modules
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/fused_llama_mlp.py -> build/bdist.linux-x86_64/wheel/auto_gptq/nn_modules
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/qlinear.py -> build/bdist.linux-x86_64/wheel/auto_gptq/nn_modules
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/qlinear_old.py -> build/bdist.linux-x86_64/wheel/auto_gptq/nn_modules
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/qlinear_triton.py -> build/bdist.linux-x86_64/wheel/auto_gptq/nn_modules
  creating build/bdist.linux-x86_64/wheel/auto_gptq/nn_modules/triton_utils
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/triton_utils/__init__.py -> build/bdist.linux-x86_64/wheel/auto_gptq/nn_modules/triton_utils
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/triton_utils/custom_autotune.py -> build/bdist.linux-x86_64/wheel/auto_gptq/nn_modules/triton_utils
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/triton_utils/kernels.py -> build/bdist.linux-x86_64/wheel/auto_gptq/nn_modules/triton_utils
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/nn_modules/triton_utils/mixin.py -> build/bdist.linux-x86_64/wheel/auto_gptq/nn_modules/triton_utils
  creating build/bdist.linux-x86_64/wheel/auto_gptq/quantization
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/quantization/__init__.py -> build/bdist.linux-x86_64/wheel/auto_gptq/quantization
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/quantization/gptq.py -> build/bdist.linux-x86_64/wheel/auto_gptq/quantization
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/quantization/quantizer.py -> build/bdist.linux-x86_64/wheel/auto_gptq/quantization
  creating build/bdist.linux-x86_64/wheel/auto_gptq/utils
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/utils/__init__.py -> build/bdist.linux-x86_64/wheel/auto_gptq/utils
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/utils/data_utils.py -> build/bdist.linux-x86_64/wheel/auto_gptq/utils
  copying build/lib.linux-x86_64-cpython-310/auto_gptq/utils/import_utils.py -> build/bdist.linux-x86_64/wheel/auto_gptq/utils
  copying build/lib.linux-x86_64-cpython-310/autogptq_cuda.cpython-310-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/wheel
  running install_egg_info
  running egg_info
  creating auto_gptq.egg-info
  writing auto_gptq.egg-info/PKG-INFO
  writing dependency_links to auto_gptq.egg-info/dependency_links.txt
  writing requirements to auto_gptq.egg-info/requires.txt
  writing top-level names to auto_gptq.egg-info/top_level.txt
  writing manifest file 'auto_gptq.egg-info/SOURCES.txt'
  reading manifest file 'auto_gptq.egg-info/SOURCES.txt'
  adding license file 'LICENSE'
  writing manifest file 'auto_gptq.egg-info/SOURCES.txt'
  Copying auto_gptq.egg-info to build/bdist.linux-x86_64/wheel/auto_gptq-0.2.1-py3.10.egg-info
  running install_scripts
  creating build/bdist.linux-x86_64/wheel/auto_gptq-0.2.1.dist-info/WHEEL
  creating '/tmp/pip-wheel-eq_lyc6t/auto_gptq-0.2.1-cp310-cp310-linux_x86_64.whl' and adding 'build/bdist.linux-x86_64/wheel' to it
  adding 'autogptq_cuda.cpython-310-x86_64-linux-gnu.so'
  adding 'auto_gptq/__init__.py'
  adding 'auto_gptq/eval_tasks/__init__.py'
  adding 'auto_gptq/eval_tasks/_base.py'
  adding 'auto_gptq/eval_tasks/language_modeling_task.py'
  adding 'auto_gptq/eval_tasks/sequence_classification_task.py'
  adding 'auto_gptq/eval_tasks/text_summarization_task.py'
  adding 'auto_gptq/eval_tasks/_utils/__init__.py'
  adding 'auto_gptq/eval_tasks/_utils/classification_utils.py'
  adding 'auto_gptq/eval_tasks/_utils/generation_utils.py'
  adding 'auto_gptq/modeling/__init__.py'
  adding 'auto_gptq/modeling/_base.py'
  adding 'auto_gptq/modeling/_const.py'
  adding 'auto_gptq/modeling/_utils.py'
  adding 'auto_gptq/modeling/auto.py'
  adding 'auto_gptq/modeling/bloom.py'
  adding 'auto_gptq/modeling/codegen.py'
  adding 'auto_gptq/modeling/gpt2.py'
  adding 'auto_gptq/modeling/gpt_bigcode.py'
  adding 'auto_gptq/modeling/gpt_neox.py'
  adding 'auto_gptq/modeling/gptj.py'
  adding 'auto_gptq/modeling/llama.py'
  adding 'auto_gptq/modeling/moss.py'
  adding 'auto_gptq/modeling/opt.py'
  adding 'auto_gptq/modeling/rw.py'
  adding 'auto_gptq/nn_modules/__init__.py'
  adding 'auto_gptq/nn_modules/_fused_base.py'
  adding 'auto_gptq/nn_modules/fused_gptj_attn.py'
  adding 'auto_gptq/nn_modules/fused_llama_attn.py'
  adding 'auto_gptq/nn_modules/fused_llama_mlp.py'
  adding 'auto_gptq/nn_modules/qlinear.py'
  adding 'auto_gptq/nn_modules/qlinear_old.py'
  adding 'auto_gptq/nn_modules/qlinear_triton.py'
  adding 'auto_gptq/nn_modules/triton_utils/__init__.py'
  adding 'auto_gptq/nn_modules/triton_utils/custom_autotune.py'
  adding 'auto_gptq/nn_modules/triton_utils/kernels.py'
  adding 'auto_gptq/nn_modules/triton_utils/mixin.py'
  adding 'auto_gptq/quantization/__init__.py'
  adding 'auto_gptq/quantization/gptq.py'
  adding 'auto_gptq/quantization/quantizer.py'
  adding 'auto_gptq/utils/__init__.py'
  adding 'auto_gptq/utils/data_utils.py'
  adding 'auto_gptq/utils/import_utils.py'
  adding 'auto_gptq-0.2.1.dist-info/LICENSE'
  adding 'auto_gptq-0.2.1.dist-info/METADATA'
  adding 'auto_gptq-0.2.1.dist-info/WHEEL'
  adding 'auto_gptq-0.2.1.dist-info/top_level.txt'
  adding 'auto_gptq-0.2.1.dist-info/RECORD'
  removing build/bdist.linux-x86_64/wheel
  Building wheel for auto-gptq (setup.py) ... done
  Created wheel for auto-gptq: filename=auto_gptq-0.2.1-cp310-cp310-linux_x86_64.whl size=2838539 sha256=f65da8f09d0f2b534c82f5e60f3a2fc00c746d4e25dfc59b5d8a2dae60b05c27
  Stored in directory: /tmp/pip-ephem-wheel-cache-pnoayywt/wheels/24/88/75/0af9bf8f82c28467ed0e61e1ded8572458d43b390028b42ccb
Successfully built auto-gptq
Installing collected packages: tokenizers, safetensors, pytz, xxhash, tzdata, tqdm, rouge, regex, pyarrow, multidict, fsspec, frozenlist, dill, async-timeout, yarl, responses, pandas, multiprocess, huggingface-hub, aiosignal, transformers, aiohttp, datasets, accelerate, auto-gptq
  changing mode of /usr/local/bin/tqdm to 755
  changing mode of /usr/local/bin/rouge to 755
  changing mode of /usr/local/bin/huggingface-cli to 755
  changing mode of /usr/local/bin/transformers-cli to 755
  changing mode of /usr/local/bin/datasets-cli to 755
  changing mode of /usr/local/bin/accelerate to 755
  changing mode of /usr/local/bin/accelerate-config to 755
  changing mode of /usr/local/bin/accelerate-launch to 755
Successfully installed accelerate-0.19.0 aiohttp-3.8.4 aiosignal-1.3.1 async-timeout-4.0.2 auto-gptq-0.2.1 datasets-2.12.0 dill-0.3.6 frozenlist-1.3.3 fsspec-2023.5.0 huggingface-hub-0.15.1 multidict-6.0.4 multiprocess-0.70.14 pandas-2.0.2 pyarrow-12.0.0 pytz-2023.3 regex-2023.6.3 responses-0.18.0 rouge-1.0.1 safetensors-0.3.1 tokenizers-0.13.3 tqdm-4.65.0 transformers-4.29.2 tzdata-2023.3 xxhash-3.2.0 yarl-1.9.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

And now it works fine:

root@005505d3451a:~# python -c 'import torch ; import autogptq_cuda'
root@005505d3451a:~#

So it feels like there are multiple different failure possibilities at the moment:

pip install auto-gptq won't try to build extension
pip install auto-gptq tries to build extension but fails, like in the above example
pip install . won't try to build extension

@TheBloke it appears from your output that it does compile the CUDA extension. What does not work is correct versioning of then compiled wheel.

When installed and you do pip list you should see 0.2.1+cuXXX for the version and not just 0.2.1, because the module can still be compiled without CUDA, so this creates ambiguity.

Don't use pip. Install it from source by

python setup.py install

everything will be fine.

This still needs fixing: https://github.com/PanQiWei/AutoGPTQ/blob/main/setup.py#L23

It will never mark the module VER+cuXXX unless it's being compiled through a GH action, so all local users will assume there's no CUDA support from the versioning.

I'm guessing this is related to https://github.com/PanQiWei/AutoGPTQ/issues/115#issuecomment-1581121864 where the autogptq_cuda directory isn't being uploaded to PyPI.

there is some problem in v0.2.1, I will look into it and release a new patch this weekend

Thank you @PanQiWei !

v0.2.2 is working a lot better. On Lambda Labs with CUDA 11.8, pip install auto-gptq worked immediately

I will test in a Docker later today

torch nightly now supports cuda12.1, I am testing out with autogptq 0.3

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

torch nightly now supports cuda12.1, I am testing out with autogptq 0.3

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

torch nightly cuda 12.1 and autogptq 0.3 worked well!

When installing pip install auto-gptq==0.2.2 inside docker image build time, I still keep getting CUDA extension not installed..

When installing pip install auto-gptq==0.2.2 inside docker image build time, I still keep getting CUDA extension not installed..

You need to export the build cuda extension macro pipe with pip install?

So I finally got it all working in Docker. Like @3dluvr said, it all depends on GITHUB_ACTIONS.

I found two options that work in Docker:

From source

ARG AUTOGPTQ="0.2.1"
# Install AutoGPTQ from source
RUN pip3 uninstall -qy auto-gptq && \
    git clone https://github.com/PanQiWei/AutoGPTQ && \
    cd AutoGPTQ && \
    git checkout v$AUTOGPTQ && \
    GITHUB_ACTIONS=true PATH=/usr/local/cuda/bin:"$PATH" TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX;8.9;9.0" pip3 install .

From PyPi

ARG AUTOGPTQ="0.2.2"
RUN pip3 uninstall -y auto-gptq && \
    CUDA_VERSION="" GITHUB_ACTIONS=true PATH=/usr/local/cuda/bin:"$PATH" TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX;8.9;9.0" pip3 install auto-gptq==$AUTOGPTQ --no-cache-dir

As discussed earlier, if using the PyPi version, one has to unset CUDA_VERSION else it causes this problem:

#10 4.157 Discarding https://files.pythonhosted.org/packages/94/07/3f3f6905a9bd334c6ee8025df42e4789379612703b935be328caaaa41c23/auto_gptq-0.2.2.tar.gz (from https://pypi.org/simple/auto-gptq/) (requires-python:>=3.8.0): Requested auto-gptq==0.2.2 from https://files.pythonhosted.org/packages/94/07/3f3f6905a9bd334c6ee8025df42e4789379612703b935be328caaaa41c23/auto_gptq-0.2.2.tar.gz has inconsistent version: expected '0.2.2', but metadata has '0.2.2+cu1180'
#10 4.158 ERROR: Could not find a version that satisfies the requirement auto-gptq==0.2.2 (from versions: 0.0.4, 0.0.5, 0.1.0, 0.2.0, 0.2.1, 0.2.2)

And the other issue, again as already discussed, is that setup.py specifically checks to see if CUDA is currently available:

if TORCH_AVAILABLE:
    BUILD_CUDA_EXT = int(os.environ.get('BUILD_CUDA_EXT', '1')) == 1

    additional_setup_kwargs = dict()
    if BUILD_CUDA_EXT and (torch.cuda.is_available() or IN_GITHUB_ACTIONS):
        from torch.utils import cpp_extension
        from distutils.sysconfig import get_python_lib
        conda_cuda_include_dir=os.path.join(get_python_lib(),"nvidia/cuda_runtime/include")

We can override this with GITHUB_ACTIONS=true but this is not at all obvious unless you read the code.

In my opinion, a simpler and more intuitive solution would be to use BUILD_CUDA_EXT. So:

If BUILD_CUDA_EXT=1, the extension is always built. No other checks.
if BUILD_CUDA_EXT=0, the extension is never built
If BUILD_CUDA_EXT is undefined, then it performs the same checks it does now.

It could then also check GITHUB_ACTIONS if that's needed as an additional override. But it shouldn't be required for the user to set GITHUB_ACTIONS=true to build the extension when using it outside of Github Actions.

I think the question is whether AutoGPTQ would ever be used with non-CUDA capable cards.

If CUDA only, the extension should always be built regardless of any setting, because what is the point of using it without CUDA - might as well use CPUs only through llama.cpp. :)

Docker would then not care one way or the other, so a minimum check should be whether CUDA is installed, and fail if it isn't.

Unless I'm not seeing other use cases here?

Great job as always @TheBloke I added pip uninstall -y auto-gptq & GITHUB_ACTIONS=true pip install auto-gptq --no-cache-dir at the top of my entrypoint.sh And now the extension is build with CUDA, only GITHUB_ACTIONS was needed I faced similar issues long time ago since cuda is not detectable during docker build, the solution was always to either disable the checks or force the build.

AutoGPTQ / AutoGPTQ

How to force pip install to build the CUDA extension? #128

From source

From PyPi