amaiya / onprem

A tool for running on-premises large language models with non-public data
https://amaiya.github.io/onprem
Apache License 2.0
684 stars 32 forks source link

WSL 2 and Docker Instructions #42

Closed dwvisser closed 4 months ago

dwvisser commented 10 months ago

FYI @amaiya

This works. I need to write up a "model card" worthy of display on Docker Hub that shows, e.g., how to map local directories into the container for onprem_data and SQLite database, map ports, etc. I will be working on that today

amaiya commented 10 months ago

@dwvisser With respect to GPU support, please see this reference that may be useful.

dwvisser commented 10 months ago

@dwvisser With respect to GPU support, please see this reference that may be useful.

Awesome! To really test this, I am looking at it on my personal laptop (7-year old Dell that used to be my work laptop). It has caused me to dive into some GPU arcana. Apparently every time the Nvidia drivers on a Fedora system are updated (installation originally was its own arcane process), it is necessary to go through a manual process to sign the associated kernel module so SecureBoot will load it. Finally, though, I have succeeded at that first validation step from the article you just shared:

➜  nvidia-smi
Tue Nov 14 10:09:45 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce 940MX           Off | 00000000:01:00.0 Off |                  N/A |
| N/A   46C    P8              N/A / 200W |      2MiB /  2048MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3055      G   /usr/bin/gnome-shell                          0MiB |
+---------------------------------------------------------------------------------------+
dwvisser commented 9 months ago

So...

My Fedora machine was never going to work, unless I went down the road of getting a VM up and running with GPU pass-through, a task in its own right.

My work laptop's Nvidia GPU, despite being newer, doesn't support CUDA.

I finally managed with some trial and error to run a local onprem model on my son's gaming machine (Windows 11 using WSL2-Ubuntu 22.04, GeForce RTX 3060). I recorded the steps I think were needed. I'm going make sure I am able to reproduce from my notes in the morning.

dwvisser commented 9 months ago

See https://github.com/dwvisser/onprem/blob/experiment-docker/wsl2-install.md for latest progress

dwvisser commented 9 months ago

So, I got it working in a Docker container, i.e., language model loads into GPU memory with BLAS=1 showing in stderr "log" output. Steps that work:

  1. The already documented steps of getting Nvidia Toolkit, Docker, & Nvidia Container Toolkit correctly installed.
  2. Build a local Docker image tagged cuda_simple using https://github.com/abetlen/llama-cpp-python/blob/main/docker/cuda_simple/Dockerfile (came from commit https://github.com/abetlen/llama-cpp-python/commit/22917989003c5e67623d54ab45affa1e0e475410)
  3. For just onprem without streamlit build the following Dockerfile (I tagged as onprem_cuda_simple):
FROM cuda_simple
RUN python3 -m pip install onprem
CMD python3
  1. Run it with the following (I haven't checked how important the --cap-add and -e options are, but am assuming they are very important):
sudo docker run --rm -it -gpus=all --cap-add SYS_RESOURCE -e USE_MLOCK=0 \
  -v /home/dale/onprem_data:/home/root/onprem_data -t onprem_cuda_simple
dwvisser commented 9 months ago

This is ready for test and review now

@mzientek would you have time to try out the new instructions and script I placed in docker/?

dwvisser commented 9 months ago

Good resource for when I'm fine-tuning the documentation for ingesting documents:

https://hugotkk.github.io/posts/docker-volume-mounting-permissions-and-ownership-explained/

dwvisser commented 9 months ago

I have confirmed that when onprem_data and its files are all owned by the current user, then the container accesses them just fine. I had issues the other day, when some of the files were owned by root:root.

Next step (Monday): Have a standard location in the container for ingested documents, and provide it as a mapped volume. The docs should state any needed changes to webapp.yml as well as how to ingest documents relative to the volume mapped in the container launch command.

dwvisser commented 9 months ago

I've got this working pretty good at this point. I've put some additional "automagic" in onprem/init.py so that the workaround for the sqlite3 version thing is obvious. I was just running an onprem Streamlit instance out of a container on MCTL. I would like to make some final tests in the morning before I call this ready for possible merge.

dwvisser commented 9 months ago

After some additional testing this morning, I am happy with this now. If I uninstall pysqlite3-binary from the container, and try to start the web app, it gives a useful error message about what is needed.

@amaiya Please consider for merge.

amaiya commented 9 months ago

Nice! I'll check it out and probably reach back out with questions. This looks great!

Also, thanks for adding this for seamlessness:

# Some older nvidia/cuda base images force this need on us.
# Try to appease chromadb expectation before it is loaded.
import sqlite3
if sqlite3.sqlite_version_info < (3, 35, 0):
    import importlib
    try:
        importlib.import_module("pysqlite3")
    except:
        raise ImportError(
            "Please install pysqlite3-binary: pip install pysqlite3-binary"
        )
    from sys import modules
    modules["sqlite3"] = modules.pop("pysqlite3")
dwvisser commented 9 months ago

Iterating towards perfection…

It turns out, it is the major and minor version that matter when building the container image. I was assuming that just the major version had to match your host's driver. But the condition is slightly more subtle than that:

$ sudo docker run --rm -it --gpus=all --cap-add SYS_RESOURCE -e USE_MLOCK=0     -v ~/onprem_data:/root/onprem_data -p 8000:8000 onprem_cuda:11.8.0 onprem --port=8000
[sudo] password for dvisser:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.8, please update your driver to a newer version, or use an earlier cuda container: unknown.

I.e., I'll have to make the build script a little smarter, with a lookup table to ensure that it is pulling the latest known base image where CUDA major version matches, and CUDA minor version is <= host version. The simplest way would be to just use the x.0 versions, but I'd like it to pull the matching version if possible.

dwvisser commented 9 months ago

I had a few distractions today. I will do the above-described work tomorrow.

dwvisser commented 9 months ago

The automatic base image selection based on all version info is ready for testing by somebody else. It worked great on MCTL for me. I will also test it on my available home machines.

dwvisser commented 9 months ago

I will also test it on my available home machines.

Other than a very slow (possibly throttled at the server) download of the largest base image layer, the test on a home machine went flawlessly, with a CUDA-enhanced onprem web app successful served out of a container and asked questions I the web UI.

dwvisser commented 9 months ago

Testing out document corpus query capability from container. Hit this error message:

/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0

dwvisser commented 9 months ago

Indeed, these are the NVIDIA CUDA-related Python packages installed in the container:

root@b527f52643ed:/# pip list | grep cuda
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105

For context, this is our 11.7.1 container that is running.

This seems very fixable. PyTorch needs to be installed before other stuff.

For CUDA 11.8:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

And this should work for CUDA 11.7:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

For CUDA 12.x, this, which we are effectively at present doing for all images:

pip3 install torch torchvision torchaudio

For CPU installs:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
dwvisser commented 9 months ago

The above can be unified with a INDEX_URL argument to the docker build, and realizing that the default index URL is https://pypi.org/pypi

dwvisser commented 9 months ago

@amaiya Now this branch is fully tested with running the Streamlit app from a container, along with instructions helping with ingesting documents and then querying them.

As for Docker Hub, I think it is a matter of peeling off a portion of README.md, and posting it as the description for the images. Another possibility I thought of, but haven't looked into, is using GitHub Container Registry to host the images. I'll look into what that would take.