chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.49k stars 234 forks source link

Is there an API we can call to return the extracted text in utf-8 encoding #236

Closed lathakris closed 4 years ago

lathakris commented 5 years ago

Is there an API we can call to return the extracted text in utf-8 encoding ?

I am using tika python version 1.15

chrismattmann commented 4 years ago

once you get the extracted text in python, just follow this guide for Python 3.x to turn the text into utf-8.

abubelinha commented 3 years ago

@chrismattmann I followed the link you mentioned but it doesn't clarify anything to me.

It says UTF8 is the default when using Python 3. But my point is ... is there any default with respect to how PDF files themselves are encoded? Do our scripts need to check something in their metadata before proceeding to grab the text and output to a local text file?

I am trying to extract text from some PDFs mainly in French, Portuguese and Spanish. My script produces UTF8 text files correctly formatted most of the time (i.e., the accented words look like in the PDF).

I made a list of files and a loop for calling tika. But it fails for some of the PDFs in the loop, and I can't tell why.

If I open the produced wrong text file in Notepad++, the tildes are never over the letters (like they were in the pdf). Sometimes they appear after the letter, and sometimes before.

Could this be related to differences in the way the PDFs were produced?

If I look at the properties of the failing pdf file I see it was created with application LaTeX with hyperref, and in PDF productor it says pdfTeX-1.40.19

Many thanks in advance for your help