Atinoda / text-generation-webui-docker

Docker variants of oobabooga's text-generation-webui, including pre-built images.
GNU Affero General Public License v3.0
395 stars 77 forks source link

Running without a GPU #13

Closed Sharpz7 closed 1 year ago

Sharpz7 commented 1 year ago

Hey,

I was wanting to check if it is possible to run this container without a GPU?

Thanks,

Atinoda commented 1 year ago

You sure can, and there's some instructions in #9 that should help you set it up - basically, just comment out all the gpu parts in the docker-compose.yml (or don't include --gpus all if you're running without compose).

You'll need to be a patient man though - it's slow as molasses without a GPU!

Sharpz7 commented 1 year ago

This didn't seem to work in my environment, and it errors that it can't find a GPU when you load a model. I will try some things and get back to you.

Sharpz7 commented 1 year ago

I managed to get it by using this guide: https://github.com/oobabooga/text-generation-webui/blob/main/docs/Low-VRAM-guide.md

And making this change:

command: ["python", "/app/server.py", "--auto-devices"]

version: "3"
services:
  text-generation-webui-docker:
    image: atinoda/text-generation-webui:default # Specify variant as the :tag
    container_name: text-generation-webui
    environment:
      - EXTRA_LAUNCH_ARGS="--listen --verbose" # Custom launch args (e.g., --model MODEL_NAME)
#      - BUILD_EXTENSIONS_LIVE="silero_tts whisper_stt" # Install named extensions during every container launch. THIS WILL SIGNIFICANLTLY SLOW LAUNCH TIME.
    ports:
      - 7860:7860  # Default web port
#      - 5000:5000  # Default API port
#      - 5005:5005  # Default streaming port
#      - 5001:5001  # Default OpenAI API extension port
    volumes:
      - ./config/loras:/app/loras
      - ./config/models:/app/models
      - ./config/presets:/app/presets
      - ./config/prompts:/app/prompts
      - ./config/softprompts:/app/softprompts
      - ./config/training:/app/training
#      - ./config/extensions:/app/extensions  # Persist all extensions
#      - ./config/extensions/silero_tts:/app/extensions/silero_tts  # Persist a single extension
    logging:
      driver:  json-file
      options:
        max-file: "3"   # number of files or file count
        max-size: '10m'
    command: ["python", "/app/server.py", "--auto-devices"]
    # deploy:
    #     resources:
    #       reservations:
    #         devices:
    #           - driver: nvidia
    #             device_ids: ['0']
    #             capabilities: [gpu]
Atinoda commented 1 year ago

Thanks for sharing your fix and confirming that it works with CPU only on your system. Enjoy your LLM-ing, and make sure your CPU cooler is tuned up!

PS. You can append --auto-devices to the EXTRA_LAUNCH_ARGS environment variable, instead of editing the CMD.

Sharpz7 commented 1 year ago

I also realised I was being silly - you can configure it from the settings:

https://drive.google.com/uc?id=1UEjDNVtbBh4oAdb4k_WJHPYpdpSXI2Kj

Thanks for the quick response. Looking forward to doing my LLM testing with this UI :))

If you would be interested in having a helm chart in this repo as well, I'd be happy to contribute

globavi commented 1 year ago

You sure can, and there's some instructions in #9 that should help you set it up - basically, just comment out all the gpu parts in the docker-compose.yml (or don't include --gpus all if you're running without compose).

You'll need to be a patient man though - it's slow as molasses without a GPU!

Hi @Atinoda, does "running without gpu" assume to also use the provided Dockerfile? Imho the base image there from cuda cannot be scheduled on a machine without gpu?

Atinoda commented 1 year ago

Hi @globavi - since this discussion there is a llama-cpu image available (see #16 ). It still uses the CUDA base image but it should work fine (I was able to run it on an Intel laptop that has only an iGPU). Can you please try it out and let me know if you run into any problems?

globavi commented 1 year ago

Hi @Atinoda,

I could start the app with the new image (adapted few things for me as i do not use docker compose but azure infrastructure) but after downloading a GGML model in the load_model process it says:

2023-08-22 08:19:23 INFO:Loading TheBloke_Llama-2-7B-Chat-GGML... │ │ CUDA error 35 at ggml-cuda.cu:4883: CUDA driver version is insufficient for CUDA runtime version │ │ /arrow/cpp/src/arrow/filesystem/s3fs.cc:2598: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit │ │ Stream closed EOF for customer-dev/claims-sle-textgen-ui-bash-684c9488c6-g4rxk (textgen-webui)

Okynawah commented 10 months ago

Hey,

I was wondering if iGPU infering is a thing? I'm not sure if there would be any gains against CPU, but I'm curious I don't find a way to make it work.