fabiomatricardi / cdQnA

repository for documents and studies about closed domain question and answering with LLM
46 stars 24 forks source link

Issue on PDF Loader - Embeddings: #1

Open neeewwww opened 1 year ago

neeewwww commented 1 year ago

Hello, finally I found a PDF Q&A with a free alternative to OpenAI. I'm testing the code, but I'm 200% iliterate and dumb in coding. I'm trying to build a Gradio/Streamlit App to answers questions on a specific topic basically like Lego and using ChatGTP to help me out. Maybe this APP will give me visibility on the market since I got fired on Jan (my market isnt coding related).

Can you help me out to figure this error? Thanks!

WARNING:unstructured:detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with another strategy. WARNING:unstructured:Falling back to partitioning with ocr_only.

FileNotFoundError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout) 567 env["LD_LIBRARY_PATH"] = poppler_path + ":" + env.get("LD_LIBRARY_PATH", "") --> 568 proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE) 569

10 frames FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'

During handling of the above exception, another exception occurred:

PDFInfoNotInstalledError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout) 592 593 except OSError: --> 594 raise PDFInfoNotInstalledError( 595 "Unable to get page count. Is poppler installed and in PATH?" 596 )

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

neeewwww commented 1 year ago

Seem to be missing a !pip install pdf-info, but theres a new error:

ImportError Traceback (most recent call last) in <cell line: 1>() 1 index = VectorstoreIndexCreator( 2 embedding=HuggingFaceEmbeddings(), ----> 3 text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)).from_loaders(loaders)

4 frames /usr/local/lib/python3.10/dist-packages/pdfminer/high_level.py in 6 from typing import Any, BinaryIO, Container, Iterator, Optional, cast 7 ----> 8 from .converter import ( 9 XMLConverter, 10 HTMLConverter,

ImportError: cannot import name 'HOCRConverter' from 'pdfminer.converter' (/usr/local/lib/python3.10/dist-packages/pdfminer/converter.py)

estkae commented 10 months ago

Funktioniert nur mit Python 3.9 und besser noch wsl/linux Ubuntu 22.04

Works only with Python 3.9 and better still wsl/linux Ubuntu 22.04