chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.49k stars 234 forks source link

Parsed text for EPUB mixes in metadata strings by default, and contains image tags + alt-text if service parameter is set to text #389

Closed bitsgalore closed 1 year ago

bitsgalore commented 1 year ago

While doing some tests with tika-python on text extraction from EPUB files I came across some unexpected behaviour. Take the example file below:

https://www.dbnl.org/tekst/berk011veel01_01/ebook/berk011veel01_01.epub

I wrote a small Python script to extract unformatted text from this EPUB. Following the example in the readme, I initially used this:

fileIn = "berk011veel01_01.epub"
fileOut = "berk011veel01_01.txt"

parsed = parser.from_file(fileIn)
content = parsed["content"]

with open(fileOut, 'w', encoding='utf-8') as fout:
    fout.write(content)

This works, but I noticed that the extracted text was followed by some odd text strings that correspond to names of fonts that are embedded in the file:

Charis SIL Bold Italic

::
::

Charis SIL Small Caps

As I didn't expect this, I tried extracting the text with the Java JAR like this:

java -jar ~/tika/tika-app-2.6.0.jar -t berk011veel01_01.epub  > berk011veel01_01_alt.txt

In this case the font names don't show up in the result. I then took at look at the tika-python source code, and noticed the "service" parameter of the "from_file" parser function:

https://github.com/chrismattmann/tika-python/blob/master/tika/parser.py#L23

So I changed the call in my code to this:

parsed = parser.from_file(fileIn, service='text')

After this change the font names were not reported anymore (but there are some other subtle changes, like the inclusion of image placeholders like "[image: cover]").

Looking at the code in more detail, it seems that depending on whether "text" is specified as as "service" value or not, the function uses a completely different method to construct the "content" string. This is a bit odd, as the only difference I would expect here is a "metadata" value of "None", with no further effect on "content".

My suggestion would be to make the behavior that now occurs when "service" is set to "text" the default (which is also consistent with the default behavior of the Java CLI app). But this leaves the inclusion of the alt-text descriptions, which are not included when Tika-app is called directly.

bitsgalore commented 1 year ago

Small update to this, I tried changing the parser call to:

parsed = parser.from_file(fileIn, xmlContent=True)

When I write the resulting parsed["content"] to file, the output is not well-formed XHTML. Instead, after the closing </html> tag of the element that contains the extracted text, there's another series of html elements that mostly contain "meta" elements. I also see some "title" elements with the names of the embedded fonts, which also showed up in my original issue. It seems that by default the parser function is adding stuff to "content" that really should be in "metadata" instead.

bitsgalore commented 1 year ago

OK, just to confirm, using the Tika server JAR directly like this:

curl -T berk011veel01_01.epub  http://localhost:9998/tika  --header "Accept: text/plain" > berk011veel01_01.txt

Output also includes image/alt-text tags like:


[image: cover]

Aster Berkhof

Veel geluk, professor!

[image: DBNL]

BUT these are not included when I use tika-app instead:

java -jar ~/tika/tika-app-2.6.0.jar -t berk011veel01_01.epub  > berk011veel01_01_alt.txt

Not sure if I'm doing something wrong or missing some obvious option here, but it seems there are 9at least) 2 separate issues here:

  1. Inclusion of metadata in extraction result (Tika-python issue)
  2. Inconsistent behavior on alt text tags between tika-app and tika-server. Could also be a documentation issue (or perhaps I'm just missing something obvious). Not a Tika-python issue (but unclear behavior on "service" parameter adds to the confusion).

I've created a separate Tika issue for 2:

https://issues.apache.org/jira/browse/TIKA-3969?filter=12326160

chrismattmann commented 1 year ago

as you correctly noted @bitsgalore this is an issue in the upstream Tika library, however yes the doco could be improved in tika-python on the use of the service parameter. Separately please submit a PR or make a suggestion for README.md and I'll update it. Thanks for this great exploration.