chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 235 forks source link

Extract pdf on per page basis #191

Closed luhgit closed 6 years ago

luhgit commented 6 years ago

Hi, Do we have support in the python-tika to extract pdf on page level? I want to deconstruct the big pdf into saparate pages and extract them saparately. Could it be done using Python-tika. In the native version, atleast they say that one can do it counting

....

tags. As Tika returns each page wrapped with these tags. I was wondering if Python-tika do have support for that?

Thanks.

chrismattmann commented 6 years ago

Hi @luhgit we typically do this by defining some "split" for the page and then just calling parsed = parser.from_file and then pages = parsed["content"].split(SOME_DELIM. Would that work?

luke4u commented 5 years ago

@chrismattmann , hope this message finds you well.

Just want to get a better understanding on what you said about 'split'. To split each PDF page, what delimiter you are referring to? The texts parser returned do not have defined delimiter with pdf texts.

For instance, below returned text from parser, after bold text (31 DECEMBER 2018) is 2nd page, not sure if \n\n\n\n is the page break used by Tika by default? I checked the rest texts, it seems \n\n\n\n appears at end of each page!

"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAhli United Bank B.S.C.\n\nCONSOLIDATED FINANCIAL STATEMENTS\n\n31 DECEMBER 2018\n\n\n\nAhli United Bank B.S.C.\n\nCONTENTS OF THE CONSOLIDATED FINANCIAL STATEMENTS\n\nIndependent auditors' report to the shareholders of Ahli United Bank B.S.C………………………………..

Thanks so much. Luke

chrismattmann commented 5 years ago

thanks @luke4u I think one way you could do it is just to use pdftk to split the PDF into individual pages, and then use Tika-Python after that?

luke4u commented 5 years ago

Thanks @chrismattmann . I was trying to find delimiter within texts parsed by Tika. Now I get to know there is no such delimiter. I decided to use what you said to split pdf and use Tika.

luhgit commented 5 years ago

Hi, I used tika-python also to split the pdf as well. You can turn xmlContent parameter as True in parser.from_file() function and later you can parse the returned XML with beautiful soup. Each page contents are wrapped in a <div> element with page attribute e.g. <div page="">. See the code below I wrote for my parser which splits and extract the contents for a pdf.

from io import StringIO
from bs4 import BeautifulSoup
from tika import parser

file_data = []
_buffer = StringIO()
data = parser.from_file(file_path, xmlContent=True)
xhtml_data = BeautifulSoup(data['content'])
for page, content in enumerate(xhtml_data.find_all('div', attrs={'class': 'page'})):
    print('Parsing page {} of pdf file...'.format(page+1))
    _buffer.write(str(content))
    parsed_content = parser.from_buffer(_buffer.getvalue())
    _buffer.truncate()
    file_data.append({'id': 'page_'+str(page+1), 'content': parsed_content['content']})

I hope this helps. Let me know if you have questions.

luke4u commented 5 years ago

Thanks @luhgit This is really helpful! more efficient than splitting pdf.

letnotimitateothers commented 4 years ago

thanks a lot @luhgit ! That really helps me too. Will try to do it on doc file now.

salvacarrion commented 4 years ago

Hi,

Thank you a lot @luhgit for the code!

In my case, buffer.truncate() wasn't truncating the buffer so the output is being accumulated (eg.:[page1, page1+page2, page1+page2+page3,...]). To fix it, I simply create a new buffer each time; it's faster.

pages_txt = []

# Read PDF file
data = parser.from_file(filename, xmlContent=True)
xhtml_data = BeautifulSoup(data['content'])
for i, content in enumerate(xhtml_data.find_all('div', attrs={'class': 'page'})):
    # Parse PDF data using TIKA (xml/html)
    # It's faster and safer to create a new buffer than truncating it
    # https://stackoverflow.com/questions/4330812/how-do-i-clear-a-stringio-object
    _buffer = StringIO()
    _buffer.write(str(content))
    parsed_content = parser.from_buffer(_buffer.getvalue())

    # Add pages
    text = parsed_content['content'].strip()
    pages_txt.append(text)

return pages_txt
luhgit commented 4 years ago

Hi @salvacarrion, I don't know what could have been the reason for the buffer not to be truncated. But I would agree that creating a new buffer every time is also another way to solve it. I just avoided it in order not to leave much work for garbage collector :) But I am happy to hear that it helped you.

mchari commented 4 years ago

Thanks for the code. It works great for me. I need to process pdf files on a page by page basis. I started off using pypdf2 as it returns text as a concatenation of pages. But I have got burned by pypdf2 - it is unable to convert to text several pages and documents. For the pdf files that I have, tika has been more robust when it converts pdf to text. The above soln takes care of both my needs : robust conversion to text and an ability to process individual pages. Thank you so much !

chrismattmann commented 4 years ago

this is great work @mchari @salvacarrion and @luhgit ! can someone write me a unit test using that function? I'll add a new module pdf.py in the tika/ folder, and keep that as a key function...

emmalaud commented 4 years ago

@salvacarrion @luhgit hi, i am having the same text accumulation problem where the buffer isn't being truncated so i tried creating a new buffer each time as suggested but that leaves me with my parsed_content['content'] string containing None. any idea how to fix this?

emmalaud commented 4 years ago

update: _buffer.seek(0) is required before _buffer.truncate(), this fixed my truncation issue!

chrismattmann commented 4 years ago

thanks @emmalaud if someone gets a chance Id appreciate someone putting together a unit test?