Cinnamon / kotaemon

An open-source RAG-based tool for chatting with your documents.
https://cinnamon.github.io/kotaemon/
Apache License 2.0
17.57k stars 1.36k forks source link

[BUG] Deployment on Red Hat OpenShift #418

Closed micuentadecasa closed 1 month ago

micuentadecasa commented 1 month ago

Description

I have tried to install the docker image in OpenShift, but it gives an error when building the image.

Reproduction steps

1. Go to workloads
2. Add container image
3. When installing it gives an error

Screenshots

![DESCRIPTION](LINK.png)

Logs

Traceback (most recent call last):
File "/app/app.py", line 3, in <module>
from theflow.settings import settings as flowsettings
File "/usr/local/lib/python3.10/site-packages/theflow/__init__.py", line 1, in <module>
from .base import Function, Node, Param, SessionFunction, unset
File "/usr/local/lib/python3.10/site-packages/theflow/base.py", line 30, in <module>
from .config import Config, ConfigProperty, DefaultConfig
File "/usr/local/lib/python3.10/site-packages/theflow/config.py", line 16, in <module>
class DefaultConfig:
File "/usr/local/lib/python3.10/site-packages/theflow/config.py", line 36, in DefaultConfig
default_backend = settings.BASE_BACKEND
File "/usr/local/lib/python3.10/site-packages/theflow/settings/__init__.py", line 64, in __getattr__
self.load_settings()
File "/usr/local/lib/python3.10/site-packages/theflow/settings/__init__.py", line 51, in load_settings
spec.loader.exec_module(module)
File "/app/flowsettings.py", line 34, in <module>
KH_APP_DATA_DIR.mkdir(parents=True, exist_ok=True)
File "/usr/local/lib/python3.10/pathlib.py", line 1175, in mkdir
self._accessor.mkdir(self, mode)
PermissionError: [Errno 13] Permission denied: '/app/ktem_app_data'

Browsers

No response

OS

No response

Additional information

No response

micuentadecasa commented 1 month ago

I found the cause of the error. OpenShift uses a non root user to run the docker/pod, so we need to modify the dockerfile to give permissions to the user group that is used to access the folders.

micuentadecasa commented 3 weeks ago

I will post it on Monday.

El sáb, 2 nov 2024, 15:03, scheckley @.***> escribió:

I found the cause of the error. OpenShift uses a non root user to run the docker/pod, so we need to modify the dockerfile to give permissions to the user group that is used to access the folders.

if you make any progress on a rootless container deployment it would be interesting to see the Dockerfile. I'm working on the same problem here.

— Reply to this email directly, view it on GitHub https://github.com/Cinnamon/kotaemon/issues/418#issuecomment-2453000142, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG6XKXTT5AHQ5FXIBFCUS3Z6TLT3AVCNFSM6AAAAABQKIRWNOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJTGAYDAMJUGI . You are receiving this because you modified the open/close state.Message ID: @.***>

scheckley commented 3 weeks ago

thanks :)

I ended up with this, which may not not be the neatest, but seems to work:

FROM python:3.10-slim as base_image

# Use Python virtual environment for dependencies to avoid system-wide installs
ENV VENV_PATH=/app/venv
RUN python3 -m venv $VENV_PATH

# Set up PATH for the virtual environment
ENV PATH="$VENV_PATH/bin:$PATH"

# Common dependencies with non-root considerations
RUN apt-get update -qqy && \
    apt-get install -y --no-install-recommends \
    ssh \
    git \
    gcc \
    g++ \
    poppler-utils \
    libpoppler-dev \
    unzip \
    curl \
    cargo \
    tesseract-ocr \
    tesseract-ocr-jpn \
    libsm6 \
    libxext6 \
    libreoffice \
    ffmpeg \
    libmagic-dev

# Set environment variables
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV PYTHONIOENCODING=UTF-8
ENV TARGETARCH=${TARGETARCH}

# Create working directory with correct permissions
WORKDIR /app

RUN chmod -R 755 /app

# Set up NLTK data directory for cache
ENV NLTK_DATA=/app/nltk_data
RUN mkdir -p /app/nltk_data && chmod -R 775 /app/nltk_data

FROM base_image as dev

# Download pdfjs
COPY scripts/download_pdfjs.sh /app/scripts/download_pdfjs.sh
RUN chmod +x /app/scripts/download_pdfjs.sh
ENV PDFJS_PREBUILT_DIR="/app/libs/ktem/ktem/assets/prebuilt/pdfjs-dist"
RUN bash scripts/download_pdfjs.sh $PDFJS_PREBUILT_DIR

# Copy contents
COPY . /app
COPY .env.example /app/.env

RUN pip install --upgrade pip

# Install pip packages
RUN pip install --no-cache-dir wheel && \
    pip install --no-cache-dir -e "libs/kotaemon" && \
    pip install --no-cache-dir graphrag nano-graphrag future python-decouple theflow==0.8.6 && \
    pip install --no-cache-dir -e "libs/ktem" && \
    pip install --no-cache-dir "pdfservices-sdk@git+https://github.com/niallcm/pdfservices-python-sdk.git@bump-and-unfreeze-requirements"

# Install torch and additional packages
RUN pip install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu && \
    pip install --no-cache-dir -e "libs/kotaemon[adv]" && \
    pip install --no-cache-dir unstructured[all-docs]

# Download NLTK packages explicitly
RUN pip install --no-cache-dir nltk && \
    python -c "import nltk; nltk.download('punkt', download_dir=nltk.data.path[0]); nltk.download('averaged_perceptron_tagger', download_dir=nltk.data.path[0])"

# Verify theflow installation
RUN pip freeze | grep theflow && \
    python -c "import theflow; print(theflow.__file__)"

RUN chmod -R 775 /app

RUN pip uninstall -y hnswlib
RUN pip uninstall -y chroma-hnswlib
RUN pip install --no-cache-dir chroma-hnswlib

# Expose the apps default port
EXPOSE 7860

# Let OpenShift automatically assign a random user
USER 1001

CMD ["python", "app.py", "--host", "0.0.0.0", "--port", "7860"]

I had a problem with hnswlib which I think stems from having multiple versions installed during the build.

app.py I edited to point to 0.0.0.0:

import os

from theflow.settings import settings as flowsettings

KH_APP_DATA_DIR = getattr(flowsettings, "KH_APP_DATA_DIR", ".")
GRADIO_TEMP_DIR = os.getenv("GRADIO_TEMP_DIR", None)
# override GRADIO_TEMP_DIR if it's not set
if GRADIO_TEMP_DIR is None:
    GRADIO_TEMP_DIR = os.path.join(KH_APP_DATA_DIR, "gradio_tmp")
    os.environ["GRADIO_TEMP_DIR"] = GRADIO_TEMP_DIR

from ktem.main import App  # noqa

app = App()
demo = app.make()
demo.queue().launch(
    favicon_path=app._favicon,
    inbrowser=True,
    allowed_paths=[
        "libs/ktem/ktem/assets",
        GRADIO_TEMP_DIR,
    ],
    server_name="0.0.0.0",
)

This has deployed on an on-premise OpenShift cluster without any admin privileges.

micuentadecasa commented 3 weeks ago

this is mine

Lite version

FROM python:3.10-slim AS lite

---------------------------------------------------------------------------

Common dependencies

RUN apt-get update -qqy && \ apt-get install -y --no-install-recommends \ ssh \ git \ gcc \ g++ \ poppler-utils \ libpoppler-dev \ unzip \ curl \ cargo

Set environment variables

ENV PYTHONDONTWRITEBYTECODE=1 ENV PYTHONUNBUFFERED=1 ENV PYTHONIOENCODING=UTF-8

Create working directory

WORKDIR /app

Adjust permissions for OpenShift's random user ID

RUN mkdir -p /app/libs && \ mkdir -p /app/scripts && \ chmod -R g+rwX /app && \ chown -R 1001:0 /app

Download pdfjs

COPY scripts/download_pdfjs.sh /app/scripts/download_pdfjs.sh RUN chmod +x /app/scripts/download_pdfjs.sh ENV PDFJS_PREBUILT_DIR="/app/libs/ktem/ktem/assets/prebuilt/pdfjs-dist"

RUN bash scripts/download_pdfjs.sh $PDFJS_PREBUILT_DIR

Copy contents

COPY . /app

Adjust permissions after copying files

RUN chmod -R g+rwX /app && chown -R 1001:0 /app

Install pip packages

RUN --mount=type=ssh \ --mount=type=cache,target=/root/.cache/pip \ pip install -e "libs/kotaemon" \ && pip install -e "libs/ktem" \ && pip install graphrag future \ && pip install "pdfservices-sdk@git+https://github.com/niallcm/pdfservices-python-sdk.git@bump-and-unfreeze-requirements"

Clean up

RUN apt-get autoremove \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* \ && rm -rf ~/.cache

Set permissions for Python packages installed in /usr/local/lib

RUN chmod -R g+rwX /usr/local/lib/python3.10/site-packages/

CMD ["python", "app.py"]

Full version

FROM lite AS full

Additional dependencies for full version

RUN apt-get update -qqy && \ apt-get install -y --no-install-recommends \ tesseract-ocr \ tesseract-ocr-jpn \ libsm6 \ libxext6 \ libreoffice \ ffmpeg \ libmagic-dev

Install torch and torchvision for unstructured

RUN --mount=type=ssh \ --mount=type=cache,target=/root/.cache/pip \ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Copy contents

COPY . /app

Create required directories and set environment variables

RUN mkdir -p /app/nltk_data && chmod -R g+rwX /app/nltk_data ENV NLTK_DATA=/app/nltk_data

RUN mkdir -p /app/matplotlib && chmod -R g+rwX /app/matplotlib ENV MPLCONFIGDIR=/app/matplotlib

RUN mkdir -p /app/fontconfig && chmod -R g+rwX /app/fontconfig ENV XDG_CACHE_HOME=/app/fontconfig

Install additional pip packages

RUN --mount=type=ssh \ --mount=type=cache,target=/root/.cache/pip \ pip install -e "libs/kotaemon[adv]" \ && pip install unstructured[all-docs]

Clean up

RUN apt-get autoremove \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* \ && rm -rf ~/.cache

Download nltk packages as required for unstructured

RUN python -c "from unstructured.nlp.tokenize import _download_nltk_packages_if_not_present; _download_nltk_packages_if_not_present()"

run the gradio app

CMD ["python", "app.py"]

scheckley commented 1 week ago

did you happen to mount persistent storage and settings between builds? i tried to mount a pvc at ~/app/ktem_app_data/ using a symbolic link, but it doesn't seem to persist if the container is rebuilt.