SciPhi-AI / R2R

The most advanced Retrieval-Augmented Generation (RAG) system, containerized and RESTful
https://r2r-docs.sciphi.ai/
MIT License
3.63k stars 270 forks source link

File ingestion gets stuck for a long time #620

Open viraptor opened 4 months ago

viraptor commented 4 months ago

Describe the bug

I'm using the following config:

{
    "app": {
        "max_file_size_in_mb": 100
    },
    "embedding": {
        "provider": "ollama",
        "base_model": "nomic-embed-text",
        "base_dimension": 768,
        "batch_size": 32
    },
    "completions": {
        "provider": "litellm",
        "model": "ollama/dolphin-llama3:8b-v2.9-q6_K"
    },
    "ingestion":{
        "excluded_parsers": [
            "gif", "jpeg", "jpg", "png", "svg", "mp3", "mp4"
        ]
    },
    "vector_database": {
        "provider": "pgvector",
        "user": "r2r",
        "password": "r2r",
        "host": "127.0.0.1",
        "db_name": "r2r",
        "port": 5432,
        "vecs_collection": "r2rnomic"
    }
}

When I ran r2r ingest-files on the EC2 documentation, the app got stuck for a long time, but without doing any work (all CPUs idle, no ollama requests visible in the logs). After over 2 minutes of waiting, it processed the file in ~30 sec. (I saw a lot of ollama embedding requests coming through).

Using commit 2f6f18c66858b4cf15d29accd19d7ef8016e98d4

To Reproduce

r2r --config-path ... ingest-files ec2-ug.pdf

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

emrgnt-cmplxty commented 4 months ago

Hi viraptor,

Perhaps it took a while to perform OCR on your document. How much compute / memory is available to your Docker?

The PDF you shared is rather large, so it is advisable to split it into small pieces to allow for more parallelization (e.g. so that OCR doesn't become a blocker).

We are working on building a more efficient / performant OCR, but that will take a few weeks / months.