When I use Textract on EPUBs from the Dutch DBNL site, textract.process results in an empty bytes object, even though other extraction tools (including Ebooklib, which is used by Textract) are able to extract text from these files without problems.
I.e. the content is an empty (zero-length) string. This happened with most of the DBNL books I tried. In some cases just a few words were extracted.
Since Textract uses Ebooklib for EPUB reading, I tried using Ebooklib directly in order to rule out an Ebooklib problem. Below a minimal test script:
#! /usr/bin/env python3
from html.parser import HTMLParser
import ebooklib
from ebooklib import epub
class HTMLFilter(HTMLParser):
# Source: https://stackoverflow.com/a/55825140/1209004
text = ""
def handle_data(self, data):
self.text += data
fileIn = "berk011veel01_01.epub"
book = epub.read_epub(fileIn)
for item in book.get_items():
if item.get_type() == ebooklib.ITEM_DOCUMENT:
content = item.get_body_content().decode()
f = HTMLFilter()
f.feed(content)
print(f.text)
Running this scripts extracts all text without any problems. Text extraction with Tika-python also works as expected. The EPUB files also passes validation with EPUBCheck 4.2.6 without any errors or warnings.
On a side note, Textract did work for me with some EPUBs I downloaded from Standard Ebooks, such as this one:
When I use Textract on EPUBs from the Dutch DBNL site,
textract.process
results in an empty bytes object, even though other extraction tools (including Ebooklib, which is used by Textract) are able to extract text from these files without problems.Take as an example the file below:
https://www.dbnl.org/tekst/berk011veel01_01/ebook/berk011veel01_01.epub
Here's some minimal code for extraction:
Result when running the script:
I.e. the
content
is an empty (zero-length) string. This happened with most of the DBNL books I tried. In some cases just a few words were extracted.Since Textract uses Ebooklib for EPUB reading, I tried using Ebooklib directly in order to rule out an Ebooklib problem. Below a minimal test script:
Running this scripts extracts all text without any problems. Text extraction with Tika-python also works as expected. The EPUB files also passes validation with EPUBCheck 4.2.6 without any errors or warnings.
On a side note, Textract did work for me with some EPUBs I downloaded from Standard Ebooks, such as this one:
https://standardebooks.org/ebooks/robert-louis-stevenson/the-strange-case-of-dr-jekyll-and-mr-hyde/downloads/robert-louis-stevenson_the-strange-case-of-dr-jekyll-and-mr-hyde.epub
Desktop: