deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.86k stars 592 forks source link

Textract.process returns empty bytes object for EPUBs from DBNL collection #455

Open bitsgalore opened 1 year ago

bitsgalore commented 1 year ago

When I use Textract on EPUBs from the Dutch DBNL site, textract.process results in an empty bytes object, even though other extraction tools (including Ebooklib, which is used by Textract) are able to extract text from these files without problems.

Take as an example the file below:

https://www.dbnl.org/tekst/berk011veel01_01/ebook/berk011veel01_01.epub

Here's some minimal code for extraction:

#! /usr/bin/env python3

import textract

fileIn = "berk011veel01_01.epub"
content = textract.process(fileIn, encoding='utf-8').decode()

print(content)
print(len(content))

Result when running the script:


0

I.e. the content is an empty (zero-length) string. This happened with most of the DBNL books I tried. In some cases just a few words were extracted.

Since Textract uses Ebooklib for EPUB reading, I tried using Ebooklib directly in order to rule out an Ebooklib problem. Below a minimal test script:

#! /usr/bin/env python3

from html.parser import HTMLParser
import ebooklib
from ebooklib import epub

class HTMLFilter(HTMLParser):
    # Source: https://stackoverflow.com/a/55825140/1209004
    text = ""
    def handle_data(self, data):
        self.text += data

fileIn = "berk011veel01_01.epub"

book = epub.read_epub(fileIn)

for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        content = item.get_body_content().decode()
        f = HTMLFilter()
        f.feed(content)
        print(f.text)

Running this scripts extracts all text without any problems. Text extraction with Tika-python also works as expected. The EPUB files also passes validation with EPUBCheck 4.2.6 without any errors or warnings.

On a side note, Textract did work for me with some EPUBs I downloaded from Standard Ebooks, such as this one:

https://standardebooks.org/ebooks/robert-louis-stevenson/the-strange-case-of-dr-jekyll-and-mr-hyde/downloads/robert-louis-stevenson_the-strange-case-of-dr-jekyll-and-mr-hyde.epub

Desktop: