D-Star-AI / dsRAG

High-performance retrieval engine for unstructured data
MIT License
1.02k stars 66 forks source link

Import "dsrag.document_parsing" from the README example couldn't be resolved #73

Open kubni opened 3 days ago

kubni commented 3 days ago

Hello. I wanted to try out dsrag on a pdf that I have. However, I had/have a couple of problems: 1) After installing dsrag with pip, I still had to manually install vertexai, google.generativeai, Pillow and pdf2image. 2) I wanted to use gpt-4o-mini and not use a Reranker (since I don't have Cohere), so I saw the example with manual text extraction. However, from dsrag.document_parsing import extract_text_from_pdf module doesn't exist. I have since found the extract_text_from_pdf function in the following module which I was able to import: from dsrag.dsparse.file_parsing.non_vlm_file_parsing import extract_text_from_pdf

3) Here is the code I tried to run:

    #!/usr/bin/env python3

from dsrag.llm import OpenAIChatAPI
from dsrag.reranker import NoReranker
from dsrag.knowledge_base import KnowledgeBase

from dsrag.dsparse.file_parsing.non_vlm_file_parsing import extract_text_from_pdf

from dotenv import load_dotenv

load_dotenv()

llm = OpenAIChatAPI(model="gpt-4o-mini")
reranker = NoReranker()

file_path = "res-pages-58-to-68.pdf"
kb_id = "dsrag_test"

kb = KnowledgeBase(kb_id=kb_id, reranker=reranker, auto_context_model=llm)
text = extract_text_from_pdf(file_path)
kb.add_document(doc_id=file_path, text=text)
results = kb.query(["List all institutional stakeholders"])
for segment in results:
    print(segment)

However this produces the following error:

Traceback (most recent call last):
  File "/home/nikola/Programming/dsrag/dsrag_test.py", line 23, in <module>
    kb.add_document(doc_id=file_path, text=text)
  File "/home/nikola/Programming/dsrag/.venv/lib/python3.12/site-packages/dsrag/knowledge_base.py", line 255, in add_document
    sections, chunks = parse_and_chunk(
                       ^^^^^^^^^^^^^^^^
  File "/home/nikola/Programming/dsrag/.venv/lib/python3.12/site-packages/dsrag/dsparse/main.py", line 109, in parse_and_chunk
    sections, chunks = parse_and_chunk_no_vlm(
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nikola/Programming/dsrag/.venv/lib/python3.12/site-packages/dsrag/dsparse/main.py", line 225, in parse_and_chunk_no_vlm
    sections, document_lines = get_sections_from_str(
                               ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nikola/Programming/dsrag/.venv/lib/python3.12/site-packages/dsrag/dsparse/sectioning_and_chunking/semantic_sectioning.py", line 259, in get_sections_from_str
    document_lines = str_to_lines(document)
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nikola/Programming/dsrag/.venv/lib/python3.12/site-packages/dsrag/dsparse/sectioning_and_chunking/semantic_sectioning.py", line 189, in str_to_lines
    lines = document.split("\n")
            ^^^^^^^^^^^^^^
AttributeError: 'tuple' object has no attribute 'split'
kubni commented 3 days ago

Writing text, _ = extract_text_from_pdf(file_path) instead of text = extract_text_from_pdf(file_path) made the error go away. However, the original problem about the missing module still stands.