aerkalov / ebooklib

Python E-book library for handling books in EPUB2/EPUB3 format -
https://ebooklib.readthedocs.io/
GNU Affero General Public License v3.0
1.45k stars 226 forks source link

Reflection #317

Open Onaffair opened 1 month ago

Onaffair commented 1 month ago

Example: book = epub.read_epub(epub_file) for item in book.items: pass

I found that item.get_content() would remove the external style of css in xhtml ,while item.content would save it

aerkalov commented 1 month ago

When library was created the idea was that it would be used to produce 100% valid EPUB3 files. And at that time most of the EPUB files were invalid EPUB 2 files with some EPUB 3 files. That is why the idea was that you would read input EPUB file into an object and instead of cleaning that book from the garbage and making it valid EPUB3 file you would create new book and you would copy there only the things you need and know are correct.

That is why you have item.content with original content and you can use lxml to parse that and find things you need (for instance in the headers and etc) and you also have item.get_content() which should be used for the books you are creating. That method will always return clean and valid content. That is why for the newly created pages you use item.add_item() to add style sheet files or JS files to it and don't use header content for it (because it will be ignored). Why? A lot of input files would have .css/.js/.png/.jpeg located who knows where and a lot of content can be invalid and the idea of the library has always been to created very structured and unified EPUB3 files which would have fonts in ./Fonts/ directory, images in ./Images/, style sheet files in ./Styles/ and etc. etc. and would always pass epubcheck validation.

This sucks if you want to just read EPUB file which is 100% valid, change something and write it down but 11 years ago when initial library was written that was not really the case.