Tika-python is not extracting texts properly?

chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Apache License 2.0

1.51k stars 236 forks source link

The code I using is following

from tika import parser
text = parser.from_file(filepath, service='text')['content']

I have some files in doc/docx/pdf format and I want to extract details as much as possible. But if there are two column table in file then tika will combine text which fall in same line even though text is in two columns.
For doc/docx file if there is text-box/table at the top of document then text in text-box/table will be appended at the end of extracted text which is very big issue for me. Can anyone suggest me how to handle issue. Note- docx-python/pdfminer/pypdf will not work for me as I have to use special url/path-to-file.

chrismattmann / tika-python