Tinche / aiofiles

File support for asyncio
Apache License 2.0
2.88k stars 153 forks source link

It doesn't work to read pdf files? #163

Closed summarizepaper closed 1 year ago

summarizepaper commented 1 year ago

Hello, I'm really struggling to read my pdf files asynchronously with aiofiles. I want to extract the text from pdfs.

The routine that works is:

with open(pdf_filename, 'rb') as file:

    resource_manager = PDFResourceManager(caching=False)

    # Create a string buffer object for text extraction
    text_io = StringIO()

    # Create a text converter object
    text_converter = TextConverter(resource_manager, text_io, laparams=LAParams())

    # Create a PDF page interpreter object
    page_interpreter = PDFPageInterpreter(resource_manager, text_converter)

    # Process each page in the PDF file

    async for page in extract_pages(file):
        page_interpreter.process_page(page)

    text = text_io.getvalue()

but then if I replace with open(pdf_filename, 'rb') as file by async with aiofiles.open(pdf_filename, 'rb') as file, then the line async for page in extract_pages(file) is not happy and it says:

async for page in extract_pages(file): TypeError: 'async for' requires an object with aiter method, got generator

So how do I get the file returned by aiofiles to be like a normal file with aiter ?

Many thanks if you can tell me what is going on.

Tinche commented 1 year ago

Looks like the extract_pages function doesn't support async files.

summarizepaper commented 1 year ago

How do we then extract text from a pdf with aiofiles?

Tinche commented 1 year ago

I don't know, that's somewhat out of scope for aiofiles. You might want to try asking the authors of your pdf library though ;)