Atinoda / text-generation-webui-docker

Docker variants of oobabooga's text-generation-webui, including pre-built images.
GNU Affero General Public License v3.0
394 stars 77 forks source link

Apple Silicon / MacOS support #22

Open tnunamak opened 1 year ago

tnunamak commented 1 year ago

For atinoda/text-generation-webui:llama-cpu-nightly:

 ⠹ text-generation-webui Pulling                                                                                                                                                                       1.2s 
no matching manifest for linux/arm64/v8 in the manifest list entries
make: *** [up] Error 18

For reference, atinoda/text-generation-webui:llama-cpu works without error.

Atinoda commented 1 year ago

That's interesting, thanks for reporting - it could be caused by a change in the nightly code. It's good that the older point release works so it's not going to be permanently broken. The next nightly might work fine, or you could try building it locally to investigate further. Are you running on an Apple ARM processor?

tnunamak commented 1 year ago

Thanks, for now I've fallen back on building locally. I'm running macOS on an Apple M1 Pro.

Atinoda commented 1 year ago

Glad that it builds correctly on your machine, thank you for confirming! Are you making any changes to the Dockerfile or does it just work? I would like the Apple ARM silicon to be supported, but I do not have a machine to do any building or testing on. Hopefully the next point release will be fine, and it's just the development of GGUF support that's causing some hiccups in the short term.

tnunamak commented 1 year ago

Sorry, wrong button 😅

I simplified the Dockerfile, iirc there was a library (AutoGPTQ?) that was failing to install:

FROM nvidia/cuda:11.8.0-devel-ubuntu22.04 AS env_base
# Pre-reqs
RUN apt-get update && apt-get install --no-install-recommends -y \
    git vim build-essential python3-dev python3-venv python3-pip
# Instantiate venv and pre-activate
RUN pip3 install virtualenv
RUN virtualenv /venv
# Set venv
ENV VIRTUAL_ENV=/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN pip3 install --upgrade pip setuptools && \
    pip3 install torch torchvision torchaudio

FROM env_base AS app_base
# Copy and enable all scripts
COPY ./scripts /scripts
RUN chmod +x /scripts/*
# Clone oobabooga/text-generation-webui
RUN git clone https://github.com/oobabooga/text-generation-webui /src
# Use script to check out specific version
ARG VERSION_TAG
ENV VERSION_TAG=${VERSION_TAG}
RUN . /scripts/checkout_src_version.sh
# Copy source to app
RUN cp -ar /src /app
# Install oobabooga/text-generation-webui
RUN --mount=type=cache,target=/root/.cache/pip pip3 install -r /app/requirements.txt

FROM nvidia/cuda:11.8.0-devel-ubuntu22.04 AS base
# Runtime pre-reqs
RUN apt-get update && apt-get install --no-install-recommends -y \
    python3-venv python3-dev git
# Copy app and src
COPY --from=app_base /app /app
COPY --from=app_base /src /src
# Copy and activate venv
COPY --from=app_base /venv /venv
ENV VIRTUAL_ENV=/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
# Finalise app setup
WORKDIR /app
EXPOSE 7860
EXPOSE 5000
EXPOSE 5005
# Required for Python print statements to appear in logs
ENV PYTHONUNBUFFERED=1
# Set build and version tags
ARG BUILD_DATE
ENV BUILD_DATE=$BUILD_DATE
RUN echo "$BUILD_DATE" > /build_date.txt
ARG VERSION_TAG
ENV VERSION_TAG=$VERSION_TAG
RUN echo "$VERSION_TAG" > /version_tag.txt
# Copy and enable all scripts
COPY ./scripts /scripts
RUN chmod +x /scripts/*
# Run
ENTRYPOINT ["/scripts/docker-entrypoint.sh"]

FROM base AS llama-cpu
RUN echo "LLAMA-CPU" >> /variant.txt
RUN apt-get install --no-install-recommends -y git python3-dev build-essential python3-pip
RUN unset TORCH_CUDA_ARCH_LIST LLAMA_CUBLAS
RUN pip uninstall -y llama_cpp_python_cuda llama-cpp-python && pip install llama-cpp-python --force-reinstall --upgrade
ENV EXTRA_LAUNCH_ARGS=""
CMD ["python3", "/app/server.py", "--cpu"]
AIWintermuteAI commented 1 year ago

Coming here from #30 - @Atinoda mentioned this is becoming Apple M1 go-to issue (worth changing title?). Using the modified Dockerfile from @tnunamak yields this:

#0 7.064 Collecting bitsandbytes==0.41.1 (from -r /app/requirements.txt (line 26))
#0 7.071   Downloading bitsandbytes-0.41.1-py3-none-any.whl.metadata (9.8 kB)
#0 7.092 ERROR: Ignored the following versions that require a different python version: 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10
#0 7.093 ERROR: Could not find a version that satisfies the requirement autoawq==0.1.4 (from versions: none)
#0 7.093 ERROR: No matching distribution found for autoawq==0.1.4
------
failed to solve: process "/bin/sh -c pip3 install -r /app/requirements.txt" did not complete successfully: exit code: 1

This means this package is not available for arm64 I guess. To tell the truth, for M1, the whole container needs to be made from scratch I think, to disregard useless CUDA stuff that takes a lot of space. But you also mentioned that you do not have Mac M1 at the moment. I will see if I can spend some time on figuring out the right Dockerfile for M1.

Atinoda commented 1 year ago

Thanks for posting your experiences here @AIWintermuteAI! I agree with you that a full rewrite makes sense for the M1 use-case. Beyond that, I think ROCM/AMD will require the same, and it also then makes sense to do the same for the CPU-only inference (which is more popular than I expected).

I will have a think on how to refactor for that scenario, probably kicking off with default and llama-cpu refactoring. The results of that work may be helpful for putting together an Apple silicon variant as well. It certainly deserves an image - ggerganov/llama-cpp on Mac is one of the things that jump-started the whole open-source LLM movement!

AIWintermuteAI commented 1 year ago

I actually made it work, but inference took ages with llama-2-7b-chat.Q5_K_M.gguf... I'm wondering why it would be the case - perhaps before I was running a similar sized model with llama-cpp with Metal acceleration? Not sure, need to test. Here is the Dockerfile for your reference - it builds and runs fine:

FROM ubuntu:22.04@sha256:02410fbfad7f2842cce3cf7655828424f4f7f6b5105b0016e24f1676f3bd15f5 AS env_base
# Pre-reqs
RUN apt-get update && apt-get install --no-install-recommends -y \
    git vim build-essential python3-dev python3-venv python3-pip
# Instantiate venv and pre-activate
RUN pip3 install virtualenv
RUN virtualenv /venv
# Credit, Itamar Turner-Trauring: https://pythonspeed.com/articles/activate-virtualenv-dockerfile/
ENV VIRTUAL_ENV=/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN pip3 install --upgrade pip setuptools && \
    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

FROM env_base AS app_base
# Copy and enable all scripts
COPY ./scripts /scripts
RUN chmod +x /scripts/*
### DEVELOPERS/ADVANCED USERS ###
# Clone oobabooga/text-generation-webui
RUN git clone https://github.com/oobabooga/text-generation-webui /src
# Use script to check out specific version
ARG VERSION_TAG
ENV VERSION_TAG=${VERSION_TAG}
RUN . /scripts/checkout_src_version.sh
# To use local source: comment out the git clone command then set the build arg `LCL_SRC_DIR`
#ARG LCL_SRC_DIR="text-generation-webui"
#COPY ${LCL_SRC_DIR} /src
#################################
ENV LLAMA_CUBLAS=1
# Copy source to app
RUN cp -ar /src /app
# Install oobabooga/text-generation-webui
RUN --mount=type=cache,target=/root/.cache/pip pip3 install -r /app/requirements_cpu_only.txt
# Install extensions
RUN --mount=type=cache,target=/root/.cache/pip \
    chmod +x /scripts/build_extensions.sh && . /scripts/build_extensions.sh
# Clone default GPTQ
#RUN git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda /app/repositories/GPTQ-for-LLaMa
# Build and install default GPTQ ('quant_cuda')
#ARG TORCH_CUDA_ARCH_LIST="6.1;7.0;7.5;8.0;8.6+PTX"
#RUN cd /app/repositories/GPTQ-for-LLaMa/ && python3 setup_cuda.py install
# Install flash attention for exllamav2
#RUN pip install flash-attn --no-build-isolation

FROM ubuntu:22.04@sha256:02410fbfad7f2842cce3cf7655828424f4f7f6b5105b0016e24f1676f3bd15f5 AS base
# Runtime pre-reqs
RUN apt-get update && apt-get install --no-install-recommends -y \
    python3-venv python3-dev git
# Copy app and src
COPY --from=app_base /app /app
COPY --from=app_base /src /src
# Copy and activate venv
COPY --from=app_base /venv /venv
ENV VIRTUAL_ENV=/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
# Finalise app setup
WORKDIR /app
EXPOSE 7860
EXPOSE 5000
EXPOSE 5005
# Required for Python print statements to appear in logs
ENV PYTHONUNBUFFERED=1
# Force variant layers to sync cache by setting --build-arg BUILD_DATE
ARG BUILD_DATE
ENV BUILD_DATE=$BUILD_DATE
RUN echo "$BUILD_DATE" > /build_date.txt
ARG VERSION_TAG
ENV VERSION_TAG=$VERSION_TAG
RUN echo "$VERSION_TAG" > /version_tag.txt
# Copy and enable all scripts
COPY ./scripts /scripts
RUN chmod +x /scripts/*
# Run
ENTRYPOINT ["/scripts/docker-entrypoint.sh"]

# VARIANT BUILDS
FROM base AS cuda
RUN echo "CUDA" >> /variant.txt
RUN apt-get install --no-install-recommends -y git python3-dev python3-pip
RUN rm -rf /app/repositories/GPTQ-for-LLaMa && \
    git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b cuda /app/repositories/GPTQ-for-LLaMa
RUN pip3 uninstall -y quant-cuda && \
    sed -i 's/^safetensors==0\.3\.0$/safetensors/g' /app/repositories/GPTQ-for-LLaMa/requirements.txt && \
    pip3 install -r /app/repositories/GPTQ-for-LLaMa/requirements.txt
ENV EXTRA_LAUNCH_ARGS=""
CMD ["python3", "/app/server.py"]

FROM base AS triton
RUN echo "TRITON" >> /variant.txt
RUN apt-get install --no-install-recommends -y git python3-dev build-essential python3-pip
RUN rm -rf /app/repositories/GPTQ-for-LLaMa && \
    git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b triton /app/repositories/GPTQ-for-LLaMa
RUN pip3 uninstall -y quant-cuda && \
    sed -i 's/^safetensors==0\.3\.0$/safetensors/g' /app/repositories/GPTQ-for-LLaMa/requirements.txt && \
    pip3 install -r /app/repositories/GPTQ-for-LLaMa/requirements.txt
ENV EXTRA_LAUNCH_ARGS=""
CMD ["python3", "/app/server.py"]

FROM base AS monkey-patch
RUN echo "4-BIT MONKEY-PATCH" >> /variant.txt
RUN apt-get install --no-install-recommends -y git python3-dev build-essential python3-pip
RUN git clone https://github.com/johnsmith0031/alpaca_lora_4bit /app/repositories/alpaca_lora_4bit && \
    cd /app/repositories/alpaca_lora_4bit && git checkout 2f704b93c961bf202937b10aac9322b092afdce0
ARG TORCH_CUDA_ARCH_LIST="8.6"
RUN pip install git+https://github.com/sterlind/GPTQ-for-LLaMa.git@lora_4bit
ENV EXTRA_LAUNCH_ARGS=""
CMD ["python3", "/app/server.py", "--monkey-patch"]

FROM base AS llama-cpu
RUN echo "LLAMA-CPU" >> /variant.txt
RUN apt-get install --no-install-recommends -y git python3-dev build-essential python3-pip libopenblas-dev
RUN unset TORCH_CUDA_ARCH_LIST LLAMA_CUBLAS
RUN pip uninstall -y llama_cpp_python_cuda llama-cpp-python && pip install llama-cpp-python --force-reinstall --upgrade
ENV EXTRA_LAUNCH_ARGS=""
CMD ["python3", "/app/server.py", "--cpu"]

FROM base AS default
RUN echo "DEFAULT" >> /variant.txt
ENV EXTRA_LAUNCH_ARGS=""
CMD ["python3", "/app/server.py"]
rmrfxyz commented 11 months ago

@AIWintermuteAI Dockerfile worked for me. But it is not possible to load any model. When loading a model, the container dies with

2023-11-29 04:00:45 INFO:Loading orca-2-13b.Q5_K_M.gguf...
qemu: uncaught target signal 4 (Illegal instruction) - core dumped
/scripts/docker-entrypoint.sh: line 69:   256 Illegal instruction     "${LAUNCHER[@]}"

It runs correctly directly on my M1 host machine, though extremely slow to respond. So I'm fairly certain this is a Docker problem. Should I make a new issue? Did anyone manage to run a model in a similar setup?

EDIT: Using image: atinoda/text-generation-webui:llama-cpu-nightly, the error became:

text-generation-webui  | 2023-11-29 07:12:29 INFO:Loading mistral-7b-instruct-v0.1.Q5_K_S.gguf...
text-generation-webui  | /scripts/docker-entrypoint.sh: line 69:   260 Killed                  "${LAUNCHER[@]}"
text-generation-webui exited with code 137

EDITEDIT: I got it to (sort'a) work using Mistral7B and ctransformers loader

text-generation-webui  | 2023-11-29 07:52:28 INFO:Loading mistral-7b-instruct-v0.1.Q5_K_S.gguf...
text-generation-webui  | 2023-11-29 07:52:29 INFO:ctransformers weights detected: models/mistral-7b-instruct-v0.1.Q5_K_S.gguf
text-generation-webui  | 2023-11-29 07:52:30 INFO:Using ctransformers model_type: llama for /app/models/mistral-7b-instruct-v0.1.Q5_K_S.gguf
text-generation-webui  | 2023-11-29 07:52:30 INFO:TRUNCATION LENGTH: 16896
text-generation-webui  | 2023-11-29 07:52:30 INFO:INSTRUCTION TEMPLATE: Mistral
text-generation-webui  | 2023-11-29 07:52:30 INFO:Loaded the model in 2.21 seconds.
text-generation-webui  | 
text-generation-webui  | 
text-generation-webui  | [INST] Continue the chat dialogue below. Write a single reply for the character "AI".
text-generation-webui  | 
text-generation-webui  | The following is a conversation with an AI Large Language Model. The AI has been trained to answer questions, provide recommendations, and help with decision making. The AI follows user requests. The AI thinks outside the box.
text-generation-webui  | AI: How can I help you today?

So it loads the models alright but it never answers. It completely soaks up my CPU and never produces any output. I suspect this can be addressed with some tuning and is not an issue for this repo.

AIWintermuteAI commented 11 months ago

qemu in your output tells me that container is built for x86 and not aarch64 - so something went wrong somewhere, as ubuntu:22.04@sha256:02410fbfad7f2842cce3cf7655828424f4f7f6b5105b0016e24f1676f3bd15f5 is an aarch64 image.

Atinoda commented 9 months ago

It is highly unlikely that accelerated inference will ever be available via Docker on Apple Silicon. This is due to Apple's implementation of Docker, where it effectively floats on a VM with a weirdly virtualised ARM CPU that does not expose the underlying CPU/GPU capabilities. Asahi Linux may be an option in the future.

However, I have an M1 Mac and I would like to use the hardware with text-generation-webui. Unfortunately, Apple does not provide any appropriate containerisation or virtualisation technologies - so I will be exploring different packaging options. If I have any developments worth sharing, I will post them here.