Hello.
I wanted to try out dsrag on a pdf that I have.
However, I had/have a couple of problems:
1) After installing dsrag with pip, I still had to manually install vertexai, google.generativeai, Pillow and pdf2image.
2) I wanted to use gpt-4o-mini and not use a Reranker (since I don't have Cohere), so I saw the example with manual text extraction.
However, from dsrag.document_parsing import extract_text_from_pdf module doesn't exist.
I have since found the extract_text_from_pdf function in the following module which I was able to import:
from dsrag.dsparse.file_parsing.non_vlm_file_parsing import extract_text_from_pdf
3) Here is the code I tried to run:
#!/usr/bin/env python3
from dsrag.llm import OpenAIChatAPI
from dsrag.reranker import NoReranker
from dsrag.knowledge_base import KnowledgeBase
from dsrag.dsparse.file_parsing.non_vlm_file_parsing import extract_text_from_pdf
from dotenv import load_dotenv
load_dotenv()
llm = OpenAIChatAPI(model="gpt-4o-mini")
reranker = NoReranker()
file_path = "res-pages-58-to-68.pdf"
kb_id = "dsrag_test"
kb = KnowledgeBase(kb_id=kb_id, reranker=reranker, auto_context_model=llm)
text = extract_text_from_pdf(file_path)
kb.add_document(doc_id=file_path, text=text)
results = kb.query(["List all institutional stakeholders"])
for segment in results:
print(segment)
However this produces the following error:
Traceback (most recent call last):
File "/home/nikola/Programming/dsrag/dsrag_test.py", line 23, in <module>
kb.add_document(doc_id=file_path, text=text)
File "/home/nikola/Programming/dsrag/.venv/lib/python3.12/site-packages/dsrag/knowledge_base.py", line 255, in add_document
sections, chunks = parse_and_chunk(
^^^^^^^^^^^^^^^^
File "/home/nikola/Programming/dsrag/.venv/lib/python3.12/site-packages/dsrag/dsparse/main.py", line 109, in parse_and_chunk
sections, chunks = parse_and_chunk_no_vlm(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nikola/Programming/dsrag/.venv/lib/python3.12/site-packages/dsrag/dsparse/main.py", line 225, in parse_and_chunk_no_vlm
sections, document_lines = get_sections_from_str(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/nikola/Programming/dsrag/.venv/lib/python3.12/site-packages/dsrag/dsparse/sectioning_and_chunking/semantic_sectioning.py", line 259, in get_sections_from_str
document_lines = str_to_lines(document)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/nikola/Programming/dsrag/.venv/lib/python3.12/site-packages/dsrag/dsparse/sectioning_and_chunking/semantic_sectioning.py", line 189, in str_to_lines
lines = document.split("\n")
^^^^^^^^^^^^^^
AttributeError: 'tuple' object has no attribute 'split'
Writing
text, _ = extract_text_from_pdf(file_path)
instead of
text = extract_text_from_pdf(file_path)
made the error go away.
However, the original problem about the missing module still stands.
Hello. I wanted to try out dsrag on a pdf that I have. However, I had/have a couple of problems: 1) After installing dsrag with pip, I still had to manually install
vertexai
,google.generativeai
,Pillow
andpdf2image
. 2) I wanted to usegpt-4o-mini
and not use a Reranker (since I don't have Cohere), so I saw the example with manual text extraction. However,from dsrag.document_parsing import extract_text_from_pdf
module doesn't exist. I have since found theextract_text_from_pdf
function in the following module which I was able to import:from dsrag.dsparse.file_parsing.non_vlm_file_parsing import extract_text_from_pdf
3) Here is the code I tried to run:
However this produces the following error: