JayZeeDesign / gpt-data-extraction

68 stars 48 forks source link

PdfiumError: Failed to load document (PDFium: File access error) while uploading the pdf file #1

Open nithinreddyyyyyy opened 1 year ago

nithinreddyyyyyy commented 1 year ago

I tried running the same python code which you uploaded in this repo, below is the code

from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from dotenv import load_dotenv
from pytesseract import image_to_string
from langchain.text_splitter import RecursiveCharacterTextSplitter
from PIL import Image
from io import BytesIO
# import pypdfium2 as pdfium
import pypdfium2 as pdfium
import streamlit as st
import multiprocessing
from tempfile import NamedTemporaryFile
import pandas as pd
import pytesseract
import json
import os
import requests

load_dotenv()

os.environ["OPENAI_API_KEY"] = ""

# 1. Convert PDF file into images via pypdfium2

def convert_pdf_to_images(file_path, scale=300 / 72):
    pdf_file = pdfium.PdfDocument(file_path)

    page_indices = [i for i in range(len(pdf_file))]

    renderer = pdf_file.render(
        pdfium.PdfBitmap.to_pil,
        page_indices=page_indices,
        scale=scale,
    )

    final_images = []

    for i, image in zip(page_indices, renderer):
        image_byte_array = BytesIO()
        image.save(image_byte_array, format='jpeg', optimize=True)
        image_byte_array = image_byte_array.getvalue()
        final_images.append(dict({i: image_byte_array}))

    return final_images

# 2. Extract text from images via pytesseract

def extract_text_from_img(list_dict_final_images):
    image_list = [list(data.values())[0] for data in list_dict_final_images]
    image_content = []

    for index, image_bytes in enumerate(image_list):
        image = Image.open(BytesIO(image_bytes))
        raw_text = str(image_to_string(image))
        image_content.append(raw_text)

    return "\\n".join(image_content)

def extract_content_from_url(url: str):
    images_list = convert_pdf_to_images(url)
    text_with_pytesseract = extract_text_from_img(images_list)

    return text_with_pytesseract

# 3. Extract structured info from text via LLM
def extract_structured_data(content: str, data_points):
    llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")
    template = """
    You are an expert admin people who will extract core information from documents

    {content}

    Above is the content; please try to extract all data points from the content above 
    and export in a JSON array format:
    {data_points}

    Now please extract details from the content  and export in a JSON array format, 
    return ONLY the JSON array:
    """

    prompt = PromptTemplate(
        input_variables=["content", "data_points"],
        template=template,
    )

    chain = LLMChain(llm=llm, prompt=prompt)

    results = chain.run(content=content, data_points=data_points)

    # text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    # chunks = text_splitter.split_documents(content)
    # results = [chain.run(content=chunk, data_points=data_points) for chunk in chunks]

    return results

# 5. Streamlit app
def main():
    default_data_points = """{
        "invoice_item": "what is the item that charged",
        "Amount": "how much does the invoice item cost in total",
        "Company_name": "company that issued the invoice",
        "invoice_date": "when was the invoice issued",
    }"""

    st.set_page_config(page_title="Doc extraction", page_icon=":bird:")

    st.header("Doc extraction :bird:")

    data_points = st.text_area(
        "Data points", value=default_data_points, height=170)

    uploaded_files = st.file_uploader(
        "upload PDFs", accept_multiple_files=True)

    if uploaded_files is not None and data_points is not None:
        results = []
        for file in uploaded_files:
            with NamedTemporaryFile(dir='.', suffix='.csv') as f:
                f.write(file.getbuffer())
                content = extract_content_from_url(f.name)
                print(content)
                data = extract_structured_data(content, data_points)
                json_data = json.loads(data)
                if isinstance(json_data, list):
                    results.extend(json_data)  # Use extend() for lists
                else:
                    results.append(json_data)  # Wrap the dict in a list

        if len(results) > 0:
            try:
                df = pd.DataFrame(results)
                st.subheader("Results")
                st.data_editor(df)

            except Exception as e:
                st.error(
                    f"An error occurred while creating the DataFrame: {e}")
                st.write(results)  # Print the data to see its content

if __name__ == '__main__':
    multiprocessing.freeze_support()
    main()

But while uploading the pdf file in streamlit app, it is returning below error

PdfiumError: Failed to load document (PDFium: File access error).

image

Can you please let me know how to fix this error?

AbubakrChan commented 1 year ago

yes please i am having the same issue

wssranjula commented 1 year ago

Having the same issue

AbubakrChan commented 1 year ago

guys i ran it on replit and the issue got resolved, idk y, byt the problem is replit doesn't support pytesseract

AbubakrChan commented 1 year ago

i am trying some alternatives , ill let u know if solved

AbubakrChan commented 1 year ago

till then please continue doing your research

nithinreddyyyyyy commented 1 year ago

It is working in local, as in I changed the code from streamlit to normal python code and tried ran. It's running, i'm unsure what's the issue with streamlit

wssranjula commented 1 year ago

Its not working for me. did you install something called tesseract?

nithinreddyyyyyy commented 1 year ago

Its not working for me. did you install something called tesseract?

Yes, try to install tesseract and load that tesseract.exe in the environment (system variables), try install pytesseract with pip and conda and load the tesseract.exe file after importing all the libraries in the python code, below is the example

pytesseract.pytesseract.tesseract_cmd = r'C:\Users\USER\AppData\Local\Tesseract-OCR\tesseract.exe'

AbubakrChan commented 1 year ago

Its not working for me. did you install something called tesseract?

Yes, try to install tesseract and load that tesseract.exe in the environment (system variables), try install pytesseract with pip and conda and load the tesseract.exe file after importing all the libraries in the python code, below is the example

pytesseract.pytesseract.tesseract_cmd = r'C:\Users\USER\AppData\Local\Tesseract-OCR\tesseract.exe'

is this running pytesseract for you on replit?

nithinreddyyyyyy commented 1 year ago

Its not working for me. did you install something called tesseract?

Yes, try to install tesseract and load that tesseract.exe in the environment (system variables), try install pytesseract with pip and conda and load the tesseract.exe file after importing all the libraries in the python code, below is the example pytesseract.pytesseract.tesseract_cmd = r'C:\Users\USER\AppData\Local\Tesseract-OCR\tesseract.exe'

is this running pytesseract for you on replit?

I use pycharm, unsure about replit.

gapilongo commented 1 year ago

i have the exact same issue and my guess is coming from streamlit, using python/flask work properly

raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: File access error).

simonhendricks commented 1 year ago

I'm somewhat of a coding noob - but I think the problem is the Temp file(s) that are created (and the path to them) are removed when the 'with NamedTemporaryFile(dir='.', suffix='.csv') as f:' block is exited.

If I force the uploaded files to persist e.g. 'With NamedTemporaryFile(dir='.', suffix='.csv', delete=False) as f:' - then the url passed to Pdfium is valid and I get no errors. I'm sure there's a more elegant solution though - as you'd need to handle the removal of the temp files