Closed ErrickVDW closed 7 months ago
Hi @ErrickVDW - GGUF is for CPU inference, so you probably don't want that if you are planning on using your GPU. There should definitely be logs - unfortunately, without them or a stack trace there's nothing I can offer in the way of help. However, the cause of the crash should definitely be recorded somewhere - try to find it! I suggest troubleshooting it interactively with the help of ChatGPT / Gemini / Claude / etc. to try and track down the issue. Perhaps somebody else who has run into the same problem may be able to offer some insight!
Thanks so much for the quick reply! I see, I was under the impression that GGML and now GGUF models allow for CPU+GPU inference. Allowing you to partially load the model onto the gpu without it having to be entity loaded into VRAM.
Unfortunately it looks as though the container crashes before anything is written to logs. I have also used tools such as netdata to hopefully get some insight but can't seem to find anything.
Sorry about that - you are indeed correct! I don't do much with hybrid model loading - usually stick to either CPU or GPU. Another user got the container working on Unraid and you can read about it in #27 - might be helpful even though its ROCM. There should be a log from the docker daemon or other service that is managing the container itself, even if the container does not produce any logs.
Amazing, thanks for the reference! I will dig through the issue you mentioned and see if I can spot anything that may help me. I will also look into the docker logging you mentioned and see if I can find anything from the crashes. Will update as soon as I can!
Thank you again
Hi @Atinoda, apologies for the late reply. I'm not sure how much help it is but I was able to find an additional error in the docker logs that was not visible in the log terminal:
{"log":"/scripts/docker-entrypoint.sh: line 69: 90 Illegal instruction \"${LAUNCHER[@]}\"\n","stream":"stderr","time":"2024-03-13T17:15:10.659958918Z"}
Fresh, default container and went straight to loading the model using llama.cpp.
Please let me know if this is of any help container.log
Good job on finding the logs! That does help. Please try removing the quotes around your EXTRA_LAUNCH_ARGS
parameter, and instead escape the spaces with a backslash - i.e., --listen\ --verbose\ --api
Thanks! I completely redid the config to make sure everything was correct. Here is it with the updated args
Unfortunately it is still crashing but I'm am seeing some new errors which feels like a good sign! container.log
Hi @Atinoda, apologies for the late reply. I'm not sure how much help it is but I was able to find an additional error in the docker logs that was not visible in the log terminal:
{"log":"/scripts/docker-entrypoint.sh: line 69: 90 Illegal instruction "${LAUNCHER[@]}"\n","stream":"stderr","time":"2024-03-13T17:15:10.659958918Z"}
Fresh, default container and went straight to loading the model using llama.cpp.
Please let me know if this is of any help container.log
I have exactly the same problem text-generation-webui | 18:28:00-234940 INFO Loading "llama-2-7b-chat.Q2_K.gguf" text-generation-webui | /scripts/docker-entrypoint.sh: line 69: 128 Illegal instruction (core dumped) "${LAUNCHER[@]}" text-generation-webui exited with code 132 user@test:~/text-generation-webui-docker$
I have removed all the folders on the host and let the container recreate them. Freshly downloaded models and not seeing the majority of those errors anymore. Just these: /scripts/docker-entrypoint.sh: line 69: 79 Illegal instruction "${LAUNCHER[@]}" /scripts/docker-entrypoint.sh: line 69: 82 Illegal instruction "${LAUNCHER[@]}"
Good job on finding the logs! That does help. Please try removing the quotes around your
EXTRA_LAUNCH_ARGS
parameter, and instead escape the spaces with a backslash - i.e.,--listen\ --verbose\ --api
Unfortunately your recommendations didn't help :(
- EXTRA_LAUNCH_ARGS=--listen\ --verbose # Custom launch args (e.g., --model MODEL_NAME)
@wififun - thank you for also reporting your issue. If we can fix it for both of you then hopefully it's a good fix!
The root problem is with how the script is (failing) to parse the launch arguments. I have been meaning to revisit that bit of the script because it has caused problems elsewhere... for now, it should be possible to get it up and running. Are you also including the # Custom launch args (e.g., --model MODEL_NAME)
part? This part must not be included for Unraid.
@Steel-skull has posted a working Unraid template in issue #5 - does that help at all?
The other thing to try is to leave EXTRA_LAUNCH_ARGS
blank - can the container launch?#
EDIT: Fixed word salad.
I have a fully updated version of the template that "Should" work, it works for me but everyones config is diffrent. (Ill post it soon.)
I've noticed it doesn't always pull a version of the cuda toolkit that matches the unraid server when loading (causing dependentcy hell and nothing to work) but as long as you keep your unraid server at driver v545.29.06 it looks to works fine.
Also, if you have a ggml or gptq that uses more vram than you have, it will crash the docker. Exl2 doesn't seem to have this issue.
@Atinoda
Here is the updated version:
<?xml version="1.0"?>
<Container version="2">
<Name>text-generation-webui</Name>
<Repository>atinoda/text-generation-webui:latest</Repository>
<Registry/>
<Network>bridge</Network>
<MyIP/>
<Shell>sh</Shell>
<Privileged>false</Privileged>
<Support/>
<Project/>
<Overview/>
<Category/>
<WebUI>http://[IP]:[PORT:7860]</WebUI>
<TemplateURL/>
<Icon/>
<ExtraParams>--runtime=nvidia</ExtraParams>
<PostArgs/>
<CPUset/>
<DateInstalled>1710364177</DateInstalled>
<DonateText/>
<DonateLink/>
<Requires/>
<Config Name="WebUI" Target="7860" Default="7860" Mode="tcp" Description="" Type="Port" Display="always" Required="true" Mask="false">7860</Config>
<Config Name="Open AI API" Target="5000" Default="5000" Mode="tcp" Description="" Type="Port" Display="always" Required="true" Mask="false">5000</Config>
<Config Name="Characters" Target="/app/characters" Default="./config/characters" Mode="rw" Description="" Type="Path" Display="always" Required="false" Mask="false">/mnt/user/text-generation-webui-docker/config/characters/</Config>
<Config Name="Loras" Target="/app/loras" Default="./config/loras" Mode="rw" Description="" Type="Path" Display="always" Required="false" Mask="false">/mnt/user/text-generation-webui-docker/config/loras/</Config>
<Config Name="Models" Target="/app/models" Default="./config/models" Mode="rw" Description="" Type="Path" Display="always" Required="false" Mask="false">/mnt/user/text-generation-webui-docker/config/models/</Config>
<Config Name="Presets" Target="/app/presets" Default="./config/presets" Mode="rw" Description="" Type="Path" Display="always" Required="false" Mask="false">/mnt/user/text-generation-webui-docker/config/presets/</Config>
<Config Name="Prompts" Target="/app/prompts" Default="./config/prompts" Mode="rw" Description="" Type="Path" Display="always" Required="false" Mask="false">/mnt/user/text-generation-webui-docker/config/prompts/</Config>
<Config Name="Training" Target="/app/training" Default="./config/training" Mode="rw" Description="" Type="Path" Display="always" Required="false" Mask="false">/mnt/user/text-generation-webui-docker/config/training/</Config>
<Config Name="Extensions" Target="/app/extensions" Default="./config/extensions" Mode="rw" Description="" Type="Path" Display="always" Required="false" Mask="false">/mnt/user/text-generation-webui-docker/config/extensions/</Config>
<Config Name="EXTRA_LAUNCH_ARGS" Target="EXTRA_LAUNCH_ARGS" Default="" Mode="" Description="" Type="Variable" Display="always" Required="false" Mask="false">"--listen --verbose --api"</Config>
<Config Name="NVIDIA_VISIBLE_DEVICES" Target="NVIDIA_VISIBLE_DEVICES" Default="all" Mode="" Description="" Type="Variable" Display="always" Required="false" Mask="false">all</Config>
<Config Name="NVIDIA_DRIVER_CAPABILITIES" Target="NVIDIA_DRIVER_CAPABILITIES" Default="all" Mode="" Description="" Type="Variable" Display="always" Required="false" Mask="false">all</Config>
</Container>
This is the exact version I use on a 2x 3090ti server, so it should work with multiple cards.
Use driver v545.29.06
Hi there @Steel-skull,
I have created a new container using that exact config, only changing the API port to 5005 because mine is currently occupied and have still run into the same issue.
Downloaded TheBloke/Llama-2-7B-Chat-GGUF, set llama.cpp to 10 layers and it unfortunately crashed immediately again.
text error warn system array login
ls: cannot access '/app/training/datasets': No such file or directory
cp: cannot create regular file '/app/training/datasets/': Not a directory
chown: cannot access '/app/training/datasets': No such file or directory
ls: cannot access '/app/training/formats': No such file or directory
cp: target '/app/training/formats/' is not a directory
chown: cannot access '/app/training/formats': No such file or directory
100%|██████████| 2.95G /2.95G 15.8MiB/s/s
*** Initialising config for: 'characters' ***/scripts/docker-entrypoint.sh: line 69: 134 Illegal instruction "${LAUNCHER[@]}"
*** Initialising config for: 'loras' ***
*** Initialising config for: 'models' ***
*** Initialising config for: 'presets' ***
*** Initialising config for: 'prompts' ***
*** Initialising config for: 'training/datasets' ***
*** Initialising config for: 'training/formats' ***
*** Initialising extension: 'Training_PRO' ***
*** Initialising extension: 'character_bias' ***
*** Initialising extension: 'coqui_tts' ***
*** Initialising extension: 'example' ***
*** Initialising extension: 'gallery' ***
*** Initialising extension: 'google_translate' ***
*** Initialising extension: 'long_replies' ***
*** Initialising extension: 'multimodal' ***
*** Initialising extension: 'ngrok' ***
*** Initialising extension: 'openai' ***
*** Initialising extension: 'perplexity_colors' ***
*** Initialising extension: 'sd_api_pictures' ***
*** Initialising extension: 'send_pictures' ***
*** Initialising extension: 'silero_tts' ***
*** Initialising extension: 'superbooga' ***
*** Initialising extension: 'superboogav2' ***
*** Initialising extension: 'whisper_stt' ***
=== Running text-generation-webui variant: 'Nvidia Extended' snapshot-2024-03-10 ===
=== (This version is 18 commits behind origin main) ===
=== Image build date: 2024-03-11 22:19:32 ===
09:18:29-157098 INFO Starting Text generation web UI
09:18:29-162116 WARNING
You are potentially exposing the web UI to the entire
internet without any access password.
You can create one with the "--gradio-auth" flag like
this:
--gradio-auth username:password
Make sure to replace username:password with your own.
09:18:29-174735 INFO Loading the extension "openai"
09:18:29-461495 INFO OpenAI-compatible API URL:
http://0.0.0.0:5000
09:18:29-463583 INFO Loading the extension "gallery"
Running on local URL: http://0.0.0.0:7860
Downloading the model to models
09:26:23-076717 INFO Loading "llama-2-7b-chat.Q3_K_S.gguf"
That looks like a different problem now - because it got all the way to running the server means that the launcher arguments are being parsed, one way or the other. You should have a python stack trace in the docker logs or webui itself when the model loading causes the crash. Did it crash instantly when you tried to load or after a small delay? These are quite different events!
Looks to be loading, but if it's crashing immediately, that might be a driver issue. Your output is not indicating an overall issue, tho.
Also, what are your settings for the gguf?
You "should" be able to fully load it into vram as 3ks only needs 5.45gb.
Also, try Exl2, as it's a fully nvidia solution,
Turboderp/Llama2-7B-exl2:4.65bpw
Finally:
Go to console on the docker app:
nvidia-smi
Then
nvcc
Lmk the answers.
Apologies I forgot to mention that I am using driver v545.29.06.
I get the following from your commands:
I also downloaded the model you provided and it also instantly crashed.
I haven't been able to get llama.cpp working with any settings but just tried 10 layers and even brought down the context lengths as well
I have two suggestions and one request. Try running a CPU-only model to see if that works, and try cranking the GPU layers to whatever the max of the slider is so it's definitely not splitting across CPU and GPU. Can you please post the stack trace from the crash when loading the model to GPU?
Hi @Atinoda . I was able to get a CPU model up and running pretty easily. le-vh/tinyllama-4bit-cpu ran fine if I loaded with Transformers in 4 bit CPU mode. I did try llama.cpp with the cpu checkbox and it also crashed immediately with this:
{"log":"18:58:58-347156 INFO Loading \"le-vh_tinyllama-4bit-cpu\" \n","stream":"stdout","time":"2024-03-14T18:58:58.348707113Z"} {"log":"/scripts/docker-entrypoint.sh: line 69: 88 Illegal instruction \"${LAUNCHER[@]}\"\n","stream":"stderr","time":"2024-03-14T18:58:58.765619517Z"}
Although this could be expected behaviour with cpu models and llama.cpp.
I also tried loading the gguf model with the maximum gpu layers (256) like you suggested and it crashed with similar logs:
{"log":"19:02:01-154413 INFO Loading \"llama-2-7b-chat.Q3_K_S.gguf\" \n","stream":"stdout","time":"2024-03-14T19:02:01.155943934Z"} {"log":"/scripts/docker-entrypoint.sh: line 69: 88 Illegal instruction \"${LAUNCHER[@]}\"\n","stream":"stderr","time":"2024-03-14T19:02:01.291377937Z"}
I would also like to note that I was able to succesfully load and use a GPTQ model using ExLlamav2_HF. These issues all seem to point to something with the llama.cpp transformer
Scratch that I see the same error with ctransformers with max gpu layers:
{"log":"19:08:34-830622 INFO Loading \"llama-2-7b-chat.Q3_K_S.gguf\" \n","stream":"stdout","time":"2024-03-14T19:08:34.832135355Z"} {"log":"19:08:35-170332 INFO ctransformers weights detected: \n","stream":"stdout","time":"2024-03-14T19:08:35.171940055Z"} {"log":" \"models/llama-2-7b-chat.Q3_K_S.gguf\" \n","stream":"stdout","time":"2024-03-14T19:08:35.171971729Z"} {"log":"/scripts/docker-entrypoint.sh: line 69: 88 Illegal instruction \"${LAUNCHER[@]}\"\n","stream":"stderr","time":"2024-03-14T19:08:36.14733953Z"}
I'm starting to suspect an issue with either your system, or Unraid. However, given that other users are having success with Unraid it seems that the former is more likely. Another aspect of this is that your GPU is quite old, and perhaps llama-cpp
is trying to call unsupported CUDA functions - a lot of the quantisation techniques use cutting edge functionality. In fact - thinking out loud - that is probably the issue.
Unfortunately what you shared is not a stack trace - it is just log outputs. You can see an example of a stack trace in the first post of #44 - it contains line numbers, modules, etc. If you can find and post that when it crashes, then I might be able to identify the problem. I may have a fix in mind if I can see what the issue is...
I checked, and the Tesla P40s I have for inferencing use the same Pascal architecture as your 1070. I haven't used them in a while but I will fire them up and test - if I encounter the same problem as you, then it could be an issue with old hardware.
Unfortunately I only know how to retrieve the logs from docker which are like the files I've provided in earlier posts. I've looked around and not entirely sure how to get the stack trace. Could someone possibly advise on where I could get this in unraid?
I checked, and the Tesla P40s I have for inferencing use the same Pascal architecture as your 1070. I haven't used them in a while but I will fire them up and test - if I encounter the same problem as you, then it could be an issue with old hardware.
That amazing! Thank you so much!
TheBloke/Llama-2-7B-Chat-GGUF:llama-2-7b-chat.Q3_K_S.gguf
works fine on my P40 with n-gpu-layers
maxed out to 256. It does not work n-gpu-layers
set to 10. Therefore I don't think it's a legacy hardware problem - leading us back to your system. I really need a stack trace to go any further troubleshooting this with you - perhaps ChatGPT / Gemini / Claude will help you get access to it? Or maybe another Unraid user can chip in.
Could you maybe tell me what specific setting you used to run it aside for the gpu layers. It's entirely possible that I've just misconfigured something along the way as well.
Still digging around for the stack trace
Settings are all defaults except for n-gpu-layers
:
Hi there @Atinoda, My apologies for only replying to this now. I have found the issue that has been causing the container to crash. After finding the same issues with OpenChat-Cude Unraid app, I went through the issues discussed on GitHub.
I have now discovered that it is actually my CPU causing this error. My machine has a Xeon E5-1660 v2, which is quite old and therefore only supports the AVX instruction set instead of the newer ones required by GGUF models.
If I had properly described my machine specs we would've found this earlier so again I apologize and thank you all for the quick and helpful guidance
Hi @ErrickVDW - thank you for sharing your results - I appreciate it and it might also help other people! I had originally considered AVX-only CPUs to probably be an edge case... but perhaps that's not how it is. Did you manage to get it working?
It is possible to build a version without the need for those instructions - I can help you with that, if you like. Seems like there is not a pre-baked version for nvidia without AVX2, but it might be possible to put one together. Another alternative is to spin up two containers - one CPU only without AVX2, and one normal nvidia container but avoiding models that use the CPU.
Hey @Atinoda,
Sorry for the silence on my end. After trying to figure out how to rebuild ooba for AVX and coming short, I was hoping to ask for you guidance and assitance in creating an AVX-only nvidia container for these GGUF and GGML models.
Any help would be greatly appreciated
Hi @ErrickVDW, no problem - we all get busy! Glad that you're back to the LLMs. Oobabooga has released a set of requirements for no AVX 2 - I have built an image for you to try out. Please try pulling the default-nvidia-avx-nightly
tag and see if it works for you!
Wow @Atinoda , I've pulled the image and have instantly been able to run the GGUF model that I had the first issues with! I can even split between GPU and CPU! Very eager to get some new models going.
I can't thank you enough for breathing new life into my old hardware!
You're very welcome, and I'm glad that it worked for you! I'm a big supporter of keeping computers going - they've basically been crazy powerful for over a decade now and I've got plenty of mature gear still in operation myself. One of my inferencing rigs is P40-based, and although it struggles with newer quant methods - it's great value.
Thank you for testing it, and I'll probably add the variant to the project later - but I'll wait for an upstream release including requirements_noavx2.txt
. In the meantime, if you would like to build the image yourself for updates - the following Dockerfile should work:
####################
### BUILD IMAGES ###
####################
# COMMON
FROM ubuntu:22.04 AS app_base
# Pre-reqs
RUN apt-get update && apt-get install --no-install-recommends -y \
git vim build-essential python3-dev python3-venv python3-pip
# Instantiate venv and pre-activate
RUN pip3 install virtualenv
RUN virtualenv /venv
# Credit, Itamar Turner-Trauring: https://pythonspeed.com/articles/activate-virtualenv-dockerfile/
ENV VIRTUAL_ENV=/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN pip3 install --upgrade pip setuptools
# Copy and enable all scripts
COPY ./scripts /scripts
RUN chmod +x /scripts/*
### DEVELOPERS/ADVANCED USERS ###
# Clone oobabooga/text-generation-webui
RUN git clone https://github.com/oobabooga/text-generation-webui /src
# Use script to check out specific version
ARG VERSION_TAG
ENV VERSION_TAG=${VERSION_TAG}
RUN . /scripts/checkout_src_version.sh
# To use local source: comment out the git clone command then set the build arg `LCL_SRC_DIR`
#ARG LCL_SRC_DIR="text-generation-webui"
#COPY ${LCL_SRC_DIR} /src
#################################
# Copy source to app
RUN cp -ar /src /app
# NVIDIA-CUDA
# Base No AVX2
FROM app_base AS app_nvidia_avx
# Install pytorch for CUDA 12.1
RUN pip3 install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 \
--index-url https://download.pytorch.org/whl/cu121
# Install oobabooga/text-generation-webui
RUN ls /app
RUN pip3 install -r /app/requirements_noavx2.txt
# Extended No AVX2
FROM app_nvidia_avx AS app_nvidia_avx_x
# Install extensions
RUN chmod +x /scripts/build_extensions.sh && \
. /scripts/build_extensions.sh
######################
### RUNTIME IMAGES ###
######################
# COMMON
FROM ubuntu:22.04 AS run_base
# Runtime pre-reqs
RUN apt-get update && apt-get install --no-install-recommends -y \
python3-venv python3-dev git
# Copy app and src
COPY --from=app_base /app /app
COPY --from=app_base /src /src
# Instantiate venv and pre-activate
ENV VIRTUAL_ENV=/venv
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
# Finalise app setup
WORKDIR /app
EXPOSE 7860
EXPOSE 5000
EXPOSE 5005
# Required for Python print statements to appear in logs
ENV PYTHONUNBUFFERED=1
# Force variant layers to sync cache by setting --build-arg BUILD_DATE
ARG BUILD_DATE
ENV BUILD_DATE=$BUILD_DATE
RUN echo "$BUILD_DATE" > /build_date.txt
ARG VERSION_TAG
ENV VERSION_TAG=$VERSION_TAG
RUN echo "$VERSION_TAG" > /version_tag.txt
# Copy and enable all scripts
COPY ./scripts /scripts
RUN chmod +x /scripts/*
# Run
ENTRYPOINT ["/scripts/docker-entrypoint.sh"]
# Extended without AVX2
FROM run_base AS default-nvidia-avx
# Copy venv
COPY --from=app_nvidia_avx_x $VIRTUAL_ENV $VIRTUAL_ENV
# Variant parameters
RUN echo "Nvidia Extended (No AVX2)" > /variant.txt
ENV EXTRA_LAUNCH_ARGS=""
CMD ["python3", "/app/server.py"]
and this is an example command you could use to build it:
docker build \
--build-arg BUILD_DATE="Now" \
--build-arg VERSION_TAG="nightly" \
--target default-nvidia-avx -t text-generation-webui:default-nvidia-avx \
--progress=plain .
I am currently running the container on unraid. I have used the docker compose file as well as maually creating the container and changing storage mounts. I am able to download the models from hf and when I select the GGUF model from the drop down it selects the llama.cpp transformer. I have tried many different variations of settings but no combination works. This is also true of ctransformers. As soon as I click load the container crashes with no logs. I am passing in my gtx 1070 with 8gb of VRAM and it is visible from within the container by running nvidia-smi. I have tried the DEFAULT, NVIDIA and even snapshots from 2023. I am not sure what I am doing wrong