DS4SD / docling

Get your docs ready for gen AI
https://ds4sd.github.io/docling
MIT License
684 stars 67 forks source link

State of GPU support #133

Open ViktorooReps opened 1 week ago

ViktorooReps commented 1 week ago

Hello Deep Search Team!

Thank you for this contribution to open source!

We are considering using your library to parse PDF files for LLM training, so we will potentially need to scale things up. Do you have any updates on GPU/multi-GPU support? Maybe some directions on where to start if we were to work on GPU support ourselves?

dolfim-ibm commented 1 week ago

Hi @ViktorooReps, thanks for reaching out.

We are planning some performance improvement in the next days/week. If you are willing to contribute, it will for sure be appreciated.

Performance will be addressed in three ways

  1. Faster PDF backend. Here we have a WIP branch in https://github.com/DS4SD/docling/pull/131
  2. (Re-)Enable multi-threaded for page batches. This was getting into segfault by some components which are not thread-safe.
  3. Make efficient use of GPUs for the models.

The initial thought about 3 are

leviataniac commented 1 week ago

We have ran multiple time a RAG pipeline with included examples here with Milvus ... even with scaling on NVIDIA GPU L4 machines and it worked very well. Was a bit challenging to compile the docker image for that, but it seems to perform better Not really did a performance metrics, but from the observations is at least 2x faster. Looking forward to v2 implementation, thank you guys for that great job.

FYI is the start of the Dockerfile for getting the things run in the docker image. Ensure, that drivers are proper activated to docker with gpu capabilities, that GPU is really used:

##########
FROM nvidia/cuda:12.6.1-runtime-ubuntu24.04

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 python3-venv python3-dev python3-pip cron git curl && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /app

RUN python3 -m venv /opt/venv

# installing pytorch with GPU support (CUDA 12.1) 
RUN /opt/venv/bin/pip install --upgrade pip && \
    /opt/venv/bin/pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

.......
###########
dolfim-ibm commented 1 week ago

@leviataniac thanks for sharing this!