from tika import parser
text = parser.from_file(filepath, service='text')['content']
I have some files in doc/docx/pdf format and I want to extract details as much as possible.
But if there are two column table in file then tika will combine text which fall in same line even though text is in two columns.
For doc/docx file if there is text-box/table at the top of document then text in text-box/table will be appended at the end of extracted text which is very big issue for me.
Can anyone suggest me how to handle issue.
Note- docx-python/pdfminer/pypdf will not work for me as I have to use special url/path-to-file.
Thank you for your question @mrm202 This sounds like an issue in the upstream Apache Tika server library. Please ask your question on dev@tika.apache.org. Thanks. cc @tballison
The code I using is following