chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 236 forks source link

PDF Text extraction: Date superscript split into separate lines #373

Closed teohsinyee closed 1 year ago

teohsinyee commented 2 years ago

My PDF original text screenshot: image

Result of extraction: image

Is there any setting to extract the exact line as 2nd of March 2015 onwards rather than splitting it into 3 lines? Very much appreciated!

chrismattmann commented 1 year ago

I think you can do something in the upstream Tika server library. Please ask on dev@tika.apache.org. cc @tballison thanks @teohsinyee