chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 236 forks source link

Tika-python is not extracting texts properly? #383

Closed mrm202 closed 1 year ago

mrm202 commented 1 year ago

The code I using is following

from tika import parser
text = parser.from_file(filepath, service='text')['content']
  1. I have some files in doc/docx/pdf format and I want to extract details as much as possible. But if there are two column table in file then tika will combine text which fall in same line even though text is in two columns.
  2. For doc/docx file if there is text-box/table at the top of document then text in text-box/table will be appended at the end of extracted text which is very big issue for me. Can anyone suggest me how to handle issue. Note- docx-python/pdfminer/pypdf will not work for me as I have to use special url/path-to-file.
chrismattmann commented 1 year ago

Thank you for your question @mrm202 This sounds like an issue in the upstream Apache Tika server library. Please ask your question on dev@tika.apache.org. Thanks. cc @tballison