NVIDIA / GenerativeAIExamples

Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
Apache License 2.0
2.43k stars 520 forks source link

ERROR:example:Failed to ingest document due to exception Unable to get page count. #196

Open grische opened 2 months ago

grische commented 2 months ago

Followed the instructions from the README and started the example from GenerativeAIExamples/RAG/examples/basic_rag/langchain.

The docker logs of chain-server:

INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:faiss.loader:Loading faiss with AVX2 support.
INFO:faiss.loader:Successfully loaded faiss with AVX2 support.
INFO:RAG.src.chain_server.utils:Using nvidia-ai-endpoints as model engine and nvidia/nv-embedqa-e5-v5 and model for embeddings
INFO:RAG.src.chain_server.utils:Using embedding model nvidia/nv-embedqa-e5-v5 hosted at api catalog
INFO:RAG.src.chain_server.utils:Using milvus collection: nvidia_api_catalog
INFO:RAG.src.chain_server.utils:Vector store created and saved.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
INFO:     172.18.0.6:48730 - "GET /documents HTTP/1.1" 200 OK
INFO:     172.18.0.6:40180 - "GET /documents HTTP/1.1" 200 OK
INFO:     172.18.0.6:60014 - "GET /documents HTTP/1.1" 200 OK
INFO:     172.18.0.6:60800 - "GET /documents HTTP/1.1" 200 OK
INFO:pikepdf._core:pikepdf C++ to Python logger bridge initialized
ERROR:example:Failed to ingest document due to exception Unable to get page count. Is poppler installed and in PATH?
ERROR:RAG.src.chain_server.server:Error from POST /documents endpoint. Ingestion of file: /tmp/gradio/b3131f976d42f2c5b2cab5027eeaabec73658e1423259694a7a7d107b65be0bd/test.pdf failed with error: Failed to upload document. Please upload an unstructured text document.
INFO:     172.18.0.6:60810 - "GET /documents HTTP/1.1" 200 OK
INFO:     172.18.0.6:60804 - "POST /documents HTTP/1.1" 500 Internal Server Error
shubhadeepd commented 1 month ago

Thanks for reporting ths issue! Are you trying to ingest files with images embedded?

grische commented 1 month ago

Yes, the PDF has images embedded: kb-terraform.pdf

shubhadeepd commented 1 month ago

Yes, the PDF has images embedded: kb-terraform.pdf

The basic RAG examples does not support ingesting PDFs with images embedded in them. Please consider using https://github.com/NVIDIA/GenerativeAIExamples/tree/main/RAG/examples/advanced_rag/multimodal_rag which supports the same.

grische commented 1 month ago

Would it be possible to strip the pictures instead of throwing an error? Or have a more clear error message?