Closed luhgit closed 6 years ago
Hi @luhgit we typically do this by defining some "split" for the page and then just calling parsed = parser.from_file
and then pages = parsed["content"].split(SOME_DELIM
. Would that work?
@chrismattmann , hope this message finds you well.
Just want to get a better understanding on what you said about 'split'. To split each PDF page, what delimiter you are referring to? The texts parser returned do not have defined delimiter with pdf texts.
For instance, below returned text from parser, after bold text (31 DECEMBER 2018) is 2nd page, not sure if \n\n\n\n is the page break used by Tika by default? I checked the rest texts, it seems \n\n\n\n appears at end of each page!
"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAhli United Bank B.S.C.\n\nCONSOLIDATED FINANCIAL STATEMENTS\n\n31 DECEMBER 2018\n\n\n\nAhli United Bank B.S.C.\n\nCONTENTS OF THE CONSOLIDATED FINANCIAL STATEMENTS\n\nIndependent auditors' report to the shareholders of Ahli United Bank B.S.C………………………………..
Thanks so much. Luke
thanks @luke4u I think one way you could do it is just to use pdftk to split the PDF into individual pages, and then use Tika-Python after that?
Thanks @chrismattmann . I was trying to find delimiter within texts parsed by Tika. Now I get to know there is no such delimiter. I decided to use what you said to split pdf and use Tika.
Hi,
I used tika-python also to split the pdf as well. You can turn xmlContent parameter as True in parser.from_file() function and later you can parse the returned XML with beautiful soup. Each page contents are wrapped in a <div>
element with page attribute e.g. <div page="">
. See the code below I wrote for my parser which splits and extract the contents for a pdf.
from io import StringIO
from bs4 import BeautifulSoup
from tika import parser
file_data = []
_buffer = StringIO()
data = parser.from_file(file_path, xmlContent=True)
xhtml_data = BeautifulSoup(data['content'])
for page, content in enumerate(xhtml_data.find_all('div', attrs={'class': 'page'})):
print('Parsing page {} of pdf file...'.format(page+1))
_buffer.write(str(content))
parsed_content = parser.from_buffer(_buffer.getvalue())
_buffer.truncate()
file_data.append({'id': 'page_'+str(page+1), 'content': parsed_content['content']})
I hope this helps. Let me know if you have questions.
Thanks @luhgit This is really helpful! more efficient than splitting pdf.
thanks a lot @luhgit ! That really helps me too. Will try to do it on doc file now.
Hi,
Thank you a lot @luhgit for the code!
In my case, buffer.truncate()
wasn't truncating the buffer so the output is being accumulated (eg.:[page1, page1+page2, page1+page2+page3,...]
). To fix it, I simply create a new buffer each time; it's faster.
pages_txt = []
# Read PDF file
data = parser.from_file(filename, xmlContent=True)
xhtml_data = BeautifulSoup(data['content'])
for i, content in enumerate(xhtml_data.find_all('div', attrs={'class': 'page'})):
# Parse PDF data using TIKA (xml/html)
# It's faster and safer to create a new buffer than truncating it
# https://stackoverflow.com/questions/4330812/how-do-i-clear-a-stringio-object
_buffer = StringIO()
_buffer.write(str(content))
parsed_content = parser.from_buffer(_buffer.getvalue())
# Add pages
text = parsed_content['content'].strip()
pages_txt.append(text)
return pages_txt
Hi @salvacarrion, I don't know what could have been the reason for the buffer not to be truncated. But I would agree that creating a new buffer every time is also another way to solve it. I just avoided it in order not to leave much work for garbage collector :) But I am happy to hear that it helped you.
Thanks for the code. It works great for me. I need to process pdf files on a page by page basis. I started off using pypdf2 as it returns text as a concatenation of pages. But I have got burned by pypdf2 - it is unable to convert to text several pages and documents. For the pdf files that I have, tika has been more robust when it converts pdf to text. The above soln takes care of both my needs : robust conversion to text and an ability to process individual pages. Thank you so much !
this is great work @mchari @salvacarrion and @luhgit ! can someone write me a unit test using that function? I'll add a new module pdf.py in the tika/ folder, and keep that as a key function...
@salvacarrion @luhgit hi, i am having the same text accumulation problem where the buffer isn't being truncated so i tried creating a new buffer each time as suggested but that leaves me with my parsed_content['content'] string containing None. any idea how to fix this?
update: _buffer.seek(0) is required before _buffer.truncate(), this fixed my truncation issue!
thanks @emmalaud if someone gets a chance Id appreciate someone putting together a unit test?
Hi, Do we have support in the python-tika to extract pdf on page level? I want to deconstruct the big pdf into saparate pages and extract them saparately. Could it be done using Python-tika. In the native version, atleast they say that one can do it counting
Thanks.