Atinoda / text-generation-webui-docker

Docker variants of oobabooga's text-generation-webui, including pre-built images.
GNU Affero General Public License v3.0
394 stars 77 forks source link

Container instantly crashes when trying to load GGUF #45

Closed ErrickVDW closed 7 months ago

ErrickVDW commented 8 months ago

I am currently running the container on unraid. I have used the docker compose file as well as maually creating the container and changing storage mounts. I am able to download the models from hf and when I select the GGUF model from the drop down it selects the llama.cpp transformer. I have tried many different variations of settings but no combination works. This is also true of ctransformers. As soon as I click load the container crashes with no logs. I am passing in my gtx 1070 with 8gb of VRAM and it is visible from within the container by running nvidia-smi. I have tried the DEFAULT, NVIDIA and even snapshots from 2023. I am not sure what I am doing wrong Screenshot 2024-03-12 at 21 11 17 Screenshot 2024-03-12 at 21 10 21 Screenshot 2024-03-12 at 21 09 49 Screenshot 2024-03-12 at 21 09 41 Screenshot 2024-03-12 at 21 14 51

Atinoda commented 8 months ago

Hi @ErrickVDW - GGUF is for CPU inference, so you probably don't want that if you are planning on using your GPU. There should definitely be logs - unfortunately, without them or a stack trace there's nothing I can offer in the way of help. However, the cause of the crash should definitely be recorded somewhere - try to find it! I suggest troubleshooting it interactively with the help of ChatGPT / Gemini / Claude / etc. to try and track down the issue. Perhaps somebody else who has run into the same problem may be able to offer some insight!

ErrickVDW commented 8 months ago

Thanks so much for the quick reply! I see, I was under the impression that GGML and now GGUF models allow for CPU+GPU inference. Allowing you to partially load the model onto the gpu without it having to be entity loaded into VRAM.

Unfortunately it looks as though the container crashes before anything is written to logs. I have also used tools such as netdata to hopefully get some insight but can't seem to find anything.

Atinoda commented 8 months ago

Sorry about that - you are indeed correct! I don't do much with hybrid model loading - usually stick to either CPU or GPU. Another user got the container working on Unraid and you can read about it in #27 - might be helpful even though its ROCM. There should be a log from the docker daemon or other service that is managing the container itself, even if the container does not produce any logs.

ErrickVDW commented 8 months ago

Amazing, thanks for the reference! I will dig through the issue you mentioned and see if I can spot anything that may help me. I will also look into the docker logging you mentioned and see if I can find anything from the crashes. Will update as soon as I can!

Thank you again

ErrickVDW commented 8 months ago

Hi @Atinoda, apologies for the late reply. I'm not sure how much help it is but I was able to find an additional error in the docker logs that was not visible in the log terminal:

{"log":"/scripts/docker-entrypoint.sh: line 69: 90 Illegal instruction \"${LAUNCHER[@]}\"\n","stream":"stderr","time":"2024-03-13T17:15:10.659958918Z"}

Fresh, default container and went straight to loading the model using llama.cpp.

Please let me know if this is of any help container.log

Atinoda commented 8 months ago

Good job on finding the logs! That does help. Please try removing the quotes around your EXTRA_LAUNCH_ARGS parameter, and instead escape the spaces with a backslash - i.e., --listen\ --verbose\ --api

ErrickVDW commented 8 months ago

Thanks! I completely redid the config to make sure everything was correct. Here is it with the updated args

image

Unfortunately it is still crashing but I'm am seeing some new errors which feels like a good sign! container.log

wififun commented 8 months ago

Hi @Atinoda, apologies for the late reply. I'm not sure how much help it is but I was able to find an additional error in the docker logs that was not visible in the log terminal:

{"log":"/scripts/docker-entrypoint.sh: line 69: 90 Illegal instruction "${LAUNCHER[@]}"\n","stream":"stderr","time":"2024-03-13T17:15:10.659958918Z"}

Fresh, default container and went straight to loading the model using llama.cpp.

Please let me know if this is of any help container.log

I have exactly the same problem text-generation-webui | 18:28:00-234940 INFO Loading "llama-2-7b-chat.Q2_K.gguf" text-generation-webui | /scripts/docker-entrypoint.sh: line 69: 128 Illegal instruction (core dumped) "${LAUNCHER[@]}" text-generation-webui exited with code 132 user@test:~/text-generation-webui-docker$

ErrickVDW commented 8 months ago

I have removed all the folders on the host and let the container recreate them. Freshly downloaded models and not seeing the majority of those errors anymore. Just these: /scripts/docker-entrypoint.sh: line 69: 79 Illegal instruction "${LAUNCHER[@]}" /scripts/docker-entrypoint.sh: line 69: 82 Illegal instruction "${LAUNCHER[@]}"

wififun commented 8 months ago

Good job on finding the logs! That does help. Please try removing the quotes around your EXTRA_LAUNCH_ARGS parameter, and instead escape the spaces with a backslash - i.e., --listen\ --verbose\ --api

Unfortunately your recommendations didn't help :(

  - EXTRA_LAUNCH_ARGS=--listen\ --verbose # Custom launch args (e.g., --model MODEL_NAME)
Atinoda commented 8 months ago

@wififun - thank you for also reporting your issue. If we can fix it for both of you then hopefully it's a good fix!

The root problem is with how the script is (failing) to parse the launch arguments. I have been meaning to revisit that bit of the script because it has caused problems elsewhere... for now, it should be possible to get it up and running. Are you also including the # Custom launch args (e.g., --model MODEL_NAME) part? This part must not be included for Unraid.

@Steel-skull has posted a working Unraid template in issue #5 - does that help at all?

The other thing to try is to leave EXTRA_LAUNCH_ARGS blank - can the container launch?#

EDIT: Fixed word salad.

Steel-skull commented 8 months ago

I have a fully updated version of the template that "Should" work, it works for me but everyones config is diffrent. (Ill post it soon.)

I've noticed it doesn't always pull a version of the cuda toolkit that matches the unraid server when loading (causing dependentcy hell and nothing to work) but as long as you keep your unraid server at driver v545.29.06 it looks to works fine.

Also, if you have a ggml or gptq that uses more vram than you have, it will crash the docker. Exl2 doesn't seem to have this issue.

Steel-skull commented 8 months ago

@Atinoda

Here is the updated version:

<?xml version="1.0"?>
<Container version="2">
  <Name>text-generation-webui</Name>
  <Repository>atinoda/text-generation-webui:latest</Repository>
  <Registry/>
  <Network>bridge</Network>
  <MyIP/>
  <Shell>sh</Shell>
  <Privileged>false</Privileged>
  <Support/>
  <Project/>
  <Overview/>
  <Category/>
  <WebUI>http://[IP]:[PORT:7860]</WebUI>
  <TemplateURL/>
  <Icon/>
  <ExtraParams>--runtime=nvidia</ExtraParams>
  <PostArgs/>
  <CPUset/>
  <DateInstalled>1710364177</DateInstalled>
  <DonateText/>
  <DonateLink/>
  <Requires/>
  <Config Name="WebUI" Target="7860" Default="7860" Mode="tcp" Description="" Type="Port" Display="always" Required="true" Mask="false">7860</Config>
  <Config Name="Open AI API" Target="5000" Default="5000" Mode="tcp" Description="" Type="Port" Display="always" Required="true" Mask="false">5000</Config>
  <Config Name="Characters" Target="/app/characters" Default="./config/characters" Mode="rw" Description="" Type="Path" Display="always" Required="false" Mask="false">/mnt/user/text-generation-webui-docker/config/characters/</Config>
  <Config Name="Loras" Target="/app/loras" Default="./config/loras" Mode="rw" Description="" Type="Path" Display="always" Required="false" Mask="false">/mnt/user/text-generation-webui-docker/config/loras/</Config>
  <Config Name="Models" Target="/app/models" Default="./config/models" Mode="rw" Description="" Type="Path" Display="always" Required="false" Mask="false">/mnt/user/text-generation-webui-docker/config/models/</Config>
  <Config Name="Presets" Target="/app/presets" Default="./config/presets" Mode="rw" Description="" Type="Path" Display="always" Required="false" Mask="false">/mnt/user/text-generation-webui-docker/config/presets/</Config>
  <Config Name="Prompts" Target="/app/prompts" Default="./config/prompts" Mode="rw" Description="" Type="Path" Display="always" Required="false" Mask="false">/mnt/user/text-generation-webui-docker/config/prompts/</Config>
  <Config Name="Training" Target="/app/training" Default="./config/training" Mode="rw" Description="" Type="Path" Display="always" Required="false" Mask="false">/mnt/user/text-generation-webui-docker/config/training/</Config>
  <Config Name="Extensions" Target="/app/extensions" Default="./config/extensions" Mode="rw" Description="" Type="Path" Display="always" Required="false" Mask="false">/mnt/user/text-generation-webui-docker/config/extensions/</Config>
  <Config Name="EXTRA_LAUNCH_ARGS" Target="EXTRA_LAUNCH_ARGS" Default="" Mode="" Description="" Type="Variable" Display="always" Required="false" Mask="false">"--listen --verbose --api"</Config>
  <Config Name="NVIDIA_VISIBLE_DEVICES" Target="NVIDIA_VISIBLE_DEVICES" Default="all" Mode="" Description="" Type="Variable" Display="always" Required="false" Mask="false">all</Config>
  <Config Name="NVIDIA_DRIVER_CAPABILITIES" Target="NVIDIA_DRIVER_CAPABILITIES" Default="all" Mode="" Description="" Type="Variable" Display="always" Required="false" Mask="false">all</Config>
</Container>

This is the exact version I use on a 2x 3090ti server, so it should work with multiple cards.

Use driver v545.29.06

ErrickVDW commented 8 months ago

Hi there @Steel-skull,

I have created a new container using that exact config, only changing the API port to 5005 because mine is currently occupied and have still run into the same issue.

Downloaded TheBloke/Llama-2-7B-Chat-GGUF, set llama.cpp to 10 layers and it unfortunately crashed immediately again.


text  error  warn  system  array  login  

ls: cannot access '/app/training/datasets': No such file or directory
cp: cannot create regular file '/app/training/datasets/': Not a directory
chown: cannot access '/app/training/datasets': No such file or directory
ls: cannot access '/app/training/formats': No such file or directory
cp: target '/app/training/formats/' is not a directory
chown: cannot access '/app/training/formats': No such file or directory
100%|██████████| 2.95G /2.95G  15.8MiB/s/s
*** Initialising config for: 'characters' ***/scripts/docker-entrypoint.sh: line 69:   134 Illegal instruction     "${LAUNCHER[@]}"

*** Initialising config for: 'loras' ***
*** Initialising config for: 'models' ***
*** Initialising config for: 'presets' ***
*** Initialising config for: 'prompts' ***
*** Initialising config for: 'training/datasets' ***
*** Initialising config for: 'training/formats' ***
*** Initialising extension: 'Training_PRO' ***
*** Initialising extension: 'character_bias' ***
*** Initialising extension: 'coqui_tts' ***
*** Initialising extension: 'example' ***
*** Initialising extension: 'gallery' ***
*** Initialising extension: 'google_translate' ***
*** Initialising extension: 'long_replies' ***
*** Initialising extension: 'multimodal' ***
*** Initialising extension: 'ngrok' ***
*** Initialising extension: 'openai' ***
*** Initialising extension: 'perplexity_colors' ***
*** Initialising extension: 'sd_api_pictures' ***
*** Initialising extension: 'send_pictures' ***
*** Initialising extension: 'silero_tts' ***
*** Initialising extension: 'superbooga' ***
*** Initialising extension: 'superboogav2' ***
*** Initialising extension: 'whisper_stt' ***
=== Running text-generation-webui variant: 'Nvidia Extended' snapshot-2024-03-10 ===
=== (This version is 18 commits behind origin main) ===
=== Image build date: 2024-03-11 22:19:32 ===
09:18:29-157098 INFO     Starting Text generation web UI                        
09:18:29-162116 WARNING                                                         
                         You are potentially exposing the web UI to the entire  
                         internet without any access password.                  
                         You can create one with the "--gradio-auth" flag like  
                         this:                                                  

                         --gradio-auth username:password                        

                         Make sure to replace username:password with your own.  
09:18:29-174735 INFO     Loading the extension "openai"                         
09:18:29-461495 INFO     OpenAI-compatible API URL:                             

                         http://0.0.0.0:5000                                    

09:18:29-463583 INFO     Loading the extension "gallery"                        

Running on local URL:  http://0.0.0.0:7860

Downloading the model to models
09:26:23-076717 INFO     Loading "llama-2-7b-chat.Q3_K_S.gguf"  
Atinoda commented 8 months ago

That looks like a different problem now - because it got all the way to running the server means that the launcher arguments are being parsed, one way or the other. You should have a python stack trace in the docker logs or webui itself when the model loading causes the crash. Did it crash instantly when you tried to load or after a small delay? These are quite different events!

Steel-skull commented 8 months ago

Looks to be loading, but if it's crashing immediately, that might be a driver issue. Your output is not indicating an overall issue, tho.

Also, what are your settings for the gguf?

You "should" be able to fully load it into vram as 3ks only needs 5.45gb.

Also, try Exl2, as it's a fully nvidia solution,

Turboderp/Llama2-7B-exl2:4.65bpw

Finally:

Go to console on the docker app:

nvidia-smi

Then

nvcc

Lmk the answers.

ErrickVDW commented 8 months ago

Apologies I forgot to mention that I am using driver v545.29.06.

I get the following from your commands:

image

I also downloaded the model you provided and it also instantly crashed.

I haven't been able to get llama.cpp working with any settings but just tried 10 layers and even brought down the context lengths as well

Atinoda commented 8 months ago

I have two suggestions and one request. Try running a CPU-only model to see if that works, and try cranking the GPU layers to whatever the max of the slider is so it's definitely not splitting across CPU and GPU. Can you please post the stack trace from the crash when loading the model to GPU?

ErrickVDW commented 8 months ago

Hi @Atinoda . I was able to get a CPU model up and running pretty easily. le-vh/tinyllama-4bit-cpu ran fine if I loaded with Transformers in 4 bit CPU mode. I did try llama.cpp with the cpu checkbox and it also crashed immediately with this:

{"log":"18:58:58-347156 INFO Loading \"le-vh_tinyllama-4bit-cpu\" \n","stream":"stdout","time":"2024-03-14T18:58:58.348707113Z"} {"log":"/scripts/docker-entrypoint.sh: line 69: 88 Illegal instruction \"${LAUNCHER[@]}\"\n","stream":"stderr","time":"2024-03-14T18:58:58.765619517Z"}

Although this could be expected behaviour with cpu models and llama.cpp.

I also tried loading the gguf model with the maximum gpu layers (256) like you suggested and it crashed with similar logs:

{"log":"19:02:01-154413 INFO Loading \"llama-2-7b-chat.Q3_K_S.gguf\" \n","stream":"stdout","time":"2024-03-14T19:02:01.155943934Z"} {"log":"/scripts/docker-entrypoint.sh: line 69: 88 Illegal instruction \"${LAUNCHER[@]}\"\n","stream":"stderr","time":"2024-03-14T19:02:01.291377937Z"}

I would also like to note that I was able to succesfully load and use a GPTQ model using ExLlamav2_HF. These issues all seem to point to something with the llama.cpp transformer

ErrickVDW commented 8 months ago

Scratch that I see the same error with ctransformers with max gpu layers:

{"log":"19:08:34-830622 INFO Loading \"llama-2-7b-chat.Q3_K_S.gguf\" \n","stream":"stdout","time":"2024-03-14T19:08:34.832135355Z"} {"log":"19:08:35-170332 INFO ctransformers weights detected: \n","stream":"stdout","time":"2024-03-14T19:08:35.171940055Z"} {"log":" \"models/llama-2-7b-chat.Q3_K_S.gguf\" \n","stream":"stdout","time":"2024-03-14T19:08:35.171971729Z"} {"log":"/scripts/docker-entrypoint.sh: line 69: 88 Illegal instruction \"${LAUNCHER[@]}\"\n","stream":"stderr","time":"2024-03-14T19:08:36.14733953Z"}

Atinoda commented 8 months ago

I'm starting to suspect an issue with either your system, or Unraid. However, given that other users are having success with Unraid it seems that the former is more likely. Another aspect of this is that your GPU is quite old, and perhaps llama-cpp is trying to call unsupported CUDA functions - a lot of the quantisation techniques use cutting edge functionality. In fact - thinking out loud - that is probably the issue.

Unfortunately what you shared is not a stack trace - it is just log outputs. You can see an example of a stack trace in the first post of #44 - it contains line numbers, modules, etc. If you can find and post that when it crashes, then I might be able to identify the problem. I may have a fix in mind if I can see what the issue is...

Atinoda commented 8 months ago

I checked, and the Tesla P40s I have for inferencing use the same Pascal architecture as your 1070. I haven't used them in a while but I will fire them up and test - if I encounter the same problem as you, then it could be an issue with old hardware.

ErrickVDW commented 8 months ago

Unfortunately I only know how to retrieve the logs from docker which are like the files I've provided in earlier posts. I've looked around and not entirely sure how to get the stack trace. Could someone possibly advise on where I could get this in unraid?

ErrickVDW commented 8 months ago

I checked, and the Tesla P40s I have for inferencing use the same Pascal architecture as your 1070. I haven't used them in a while but I will fire them up and test - if I encounter the same problem as you, then it could be an issue with old hardware.

That amazing! Thank you so much!

Atinoda commented 8 months ago

TheBloke/Llama-2-7B-Chat-GGUF:llama-2-7b-chat.Q3_K_S.gguf works fine on my P40 with n-gpu-layers maxed out to 256. It does not work n-gpu-layers set to 10. Therefore I don't think it's a legacy hardware problem - leading us back to your system. I really need a stack trace to go any further troubleshooting this with you - perhaps ChatGPT / Gemini / Claude will help you get access to it? Or maybe another Unraid user can chip in.

ErrickVDW commented 8 months ago

Could you maybe tell me what specific setting you used to run it aside for the gpu layers. It's entirely possible that I've just misconfigured something along the way as well.

Still digging around for the stack trace

Atinoda commented 8 months ago

Settings are all defaults except for n-gpu-layers: image

ErrickVDW commented 7 months ago

Hi there @Atinoda, My apologies for only replying to this now. I have found the issue that has been causing the container to crash. After finding the same issues with OpenChat-Cude Unraid app, I went through the issues discussed on GitHub.

I have now discovered that it is actually my CPU causing this error. My machine has a Xeon E5-1660 v2, which is quite old and therefore only supports the AVX instruction set instead of the newer ones required by GGUF models.

If I had properly described my machine specs we would've found this earlier so again I apologize and thank you all for the quick and helpful guidance

Atinoda commented 7 months ago

Hi @ErrickVDW - thank you for sharing your results - I appreciate it and it might also help other people! I had originally considered AVX-only CPUs to probably be an edge case... but perhaps that's not how it is. Did you manage to get it working?

It is possible to build a version without the need for those instructions - I can help you with that, if you like. Seems like there is not a pre-baked version for nvidia without AVX2, but it might be possible to put one together. Another alternative is to spin up two containers - one CPU only without AVX2, and one normal nvidia container but avoiding models that use the CPU.

ErrickVDW commented 6 months ago

Hey @Atinoda,

Sorry for the silence on my end. After trying to figure out how to rebuild ooba for AVX and coming short, I was hoping to ask for you guidance and assitance in creating an AVX-only nvidia container for these GGUF and GGML models.

Any help would be greatly appreciated

Atinoda commented 6 months ago

Hi @ErrickVDW, no problem - we all get busy! Glad that you're back to the LLMs. Oobabooga has released a set of requirements for no AVX 2 - I have built an image for you to try out. Please try pulling the default-nvidia-avx-nightly tag and see if it works for you!

ErrickVDW commented 6 months ago

Wow @Atinoda , I've pulled the image and have instantly been able to run the GGUF model that I had the first issues with! I can even split between GPU and CPU! Very eager to get some new models going.

I can't thank you enough for breathing new life into my old hardware!

Atinoda commented 6 months ago

You're very welcome, and I'm glad that it worked for you! I'm a big supporter of keeping computers going - they've basically been crazy powerful for over a decade now and I've got plenty of mature gear still in operation myself. One of my inferencing rigs is P40-based, and although it struggles with newer quant methods - it's great value.

Thank you for testing it, and I'll probably add the variant to the project later - but I'll wait for an upstream release including requirements_noavx2.txt. In the meantime, if you would like to build the image yourself for updates - the following Dockerfile should work:

####################
### BUILD IMAGES ###
####################

# COMMON
FROM ubuntu:22.04 AS app_base
# Pre-reqs
RUN apt-get update && apt-get install --no-install-recommends -y \
    git vim build-essential python3-dev python3-venv python3-pip
# Instantiate venv and pre-activate
RUN pip3 install virtualenv
RUN virtualenv /venv
# Credit, Itamar Turner-Trauring: https://pythonspeed.com/articles/activate-virtualenv-dockerfile/
ENV VIRTUAL_ENV=/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN pip3 install --upgrade pip setuptools
# Copy and enable all scripts
COPY ./scripts /scripts
RUN chmod +x /scripts/*
### DEVELOPERS/ADVANCED USERS ###
# Clone oobabooga/text-generation-webui
RUN git clone https://github.com/oobabooga/text-generation-webui /src
# Use script to check out specific version
ARG VERSION_TAG
ENV VERSION_TAG=${VERSION_TAG}
RUN . /scripts/checkout_src_version.sh
# To use local source: comment out the git clone command then set the build arg `LCL_SRC_DIR`
#ARG LCL_SRC_DIR="text-generation-webui"
#COPY ${LCL_SRC_DIR} /src
#################################
# Copy source to app
RUN cp -ar /src /app

# NVIDIA-CUDA
# Base No AVX2
FROM app_base AS app_nvidia_avx
# Install pytorch for CUDA 12.1
RUN pip3 install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 \
    --index-url https://download.pytorch.org/whl/cu121 
# Install oobabooga/text-generation-webui
RUN ls /app
RUN pip3 install -r /app/requirements_noavx2.txt

# Extended No AVX2
FROM app_nvidia_avx AS app_nvidia_avx_x
# Install extensions
RUN chmod +x /scripts/build_extensions.sh && \
    . /scripts/build_extensions.sh

######################
### RUNTIME IMAGES ###
######################

# COMMON
FROM ubuntu:22.04 AS run_base
# Runtime pre-reqs
RUN apt-get update && apt-get install --no-install-recommends -y \
    python3-venv python3-dev git
# Copy app and src
COPY --from=app_base /app /app
COPY --from=app_base /src /src
# Instantiate venv and pre-activate
ENV VIRTUAL_ENV=/venv
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
# Finalise app setup
WORKDIR /app
EXPOSE 7860
EXPOSE 5000
EXPOSE 5005
# Required for Python print statements to appear in logs
ENV PYTHONUNBUFFERED=1
# Force variant layers to sync cache by setting --build-arg BUILD_DATE
ARG BUILD_DATE
ENV BUILD_DATE=$BUILD_DATE
RUN echo "$BUILD_DATE" > /build_date.txt
ARG VERSION_TAG
ENV VERSION_TAG=$VERSION_TAG
RUN echo "$VERSION_TAG" > /version_tag.txt
# Copy and enable all scripts
COPY ./scripts /scripts
RUN chmod +x /scripts/*
# Run
ENTRYPOINT ["/scripts/docker-entrypoint.sh"]

# Extended without AVX2
FROM run_base AS default-nvidia-avx
# Copy venv
COPY --from=app_nvidia_avx_x $VIRTUAL_ENV $VIRTUAL_ENV
# Variant parameters
RUN echo "Nvidia Extended (No AVX2)" > /variant.txt
ENV EXTRA_LAUNCH_ARGS=""
CMD ["python3", "/app/server.py"]

and this is an example command you could use to build it:

docker build  \
  --build-arg BUILD_DATE="Now" \
  --build-arg VERSION_TAG="nightly" \
  --target default-nvidia-avx -t text-generation-webui:default-nvidia-avx \
  --progress=plain .