Option to not-read certain media types

aerkalov / ebooklib

Python E-book library for handling books in EPUB2/EPUB3 format -

https://ebooklib.readthedocs.io/

GNU Affero General Public License v3.0

1.49k stars 231 forks source link

Option to not-read certain media types #247

Open stichiboi opened 2 years ago

stichiboi commented 2 years ago

Hello I'm trying to read data from epubs I downloaded from the web. I'm just interested in the text, I don't care about images or styles Would it be possible to add a media_type_filter option and only load the specified types from the manifest?

I imagine something along the lines of, in epub.EpubReader._load_manifest

media_type = r.get('media-type')
if self.media_type_filter and len(self.media_type_filter) and media_type not in self.media_type_filter:
    return

And the media_type_filter would just be a list I pass in as options

stichiboi commented 2 years ago

Just to be transparent: this idea originates from an error I keep getting when reading some epubs

KeyError: "There is no item named 'styles/3.ttf' in the archive"

This error originates from the epub rather than from ebooklib: opening the file with Atom shows that indeed there is no styles/3.ttf (there is a fonts/3.ttf).

I don't want to throw away the whole epub just because it cannot read the styles, so ideally I could just skip reading them

This should also make the process quicker.

But I'm no expert in EPUB, so maybe this is not a good idea 😓

aerkalov commented 2 years ago

Good point. Everything fails now if EPUB claims to have something which is really missing in the archive. One option would be for the EpubReader. Something like fail silently. The other one would be like you suggested - list of things to ignore/allow.