aerkalov / ebooklib

Python E-book library for handling books in EPUB2/EPUB3 format -
https://ebooklib.readthedocs.io/
GNU Affero General Public License v3.0
1.48k stars 231 forks source link

modify epub file, but all the contents inside <head> was lost #221

Open hrhktkbzyy opened 3 years ago

hrhktkbzyy commented 3 years ago

I have posted my question here: https://stackoverflow.com/questions/66061399/modify-epub-file-by-pythons-ebooklib-but-all-the-contents-inside-head-was-lo

I'm using the python ebook library ebooklib to modify a batch of epub files. The dummy code is as the following.

book = epub.read_epub(input_path)

page_add = epub.EpubHtml(title='index_add', file_name='index_add.html', lang='en')
page_add.content = u'''
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
    <body>
        <div>
            I'm a new added page
        </div>
    </body>
</html>
'''
book.add_item(page_add)

book.spine.insert(1, page_add)

epub.write_epub(output_path, book, {})

After running the code, a new epub file was generated, and the new page was added to it. The issue is that all the original content of the original epub file lost their styles.

As we know, the epub file is composed of HTML files. I changed the file extension from .epub to .zip, and then unzip it, then I can get all the HTML files. After digging into these files for a while, I found the reason of losing all the style is that all the stylesheet file was located inside the <head> tag of all the original HTML files, but the new file lost all of these content inside the <head> tag. The original <head> looks like the following:

<head>
    <link href="../stylesheet.css" rel="stylesheet" type="text/css"/>
    <link href="../page_styles.css" rel="stylesheet" type="text/css"/>
</head>

From the ebooklib's doc, I found the following description:

When defining content you can define it as valid HTML file or just parts of HTML elements you have as a content. It will ignore whatever you have in <head> element.

I think this may be the reason why all the content inside <head> tag was lost. I don't know why ebooklib did this. Does anyone have a way to fix it? I think my requirement is quite common. Just add a page into lots of existed epub files.

Any help will be highly appreciated.

aerkalov commented 3 years ago

I will just copy paste what I answered on stackoverflow.

The only proper way to do it with Ebooklib is to read EPUB file and construct new EPUB from scratch by cherry picking what you need from the original file. You were never supposed to read the file, modify it and write it down because we wanted to always end up with valid EPUB3 and our approach was "I will ignore all the garbage metadata, extra files, just take what I need and keep my layout of the folders".

That being said, that was for the online publishing system we worked on. Using Ebooklib outside of the system it does make a lot of sense to be able to do something like that. I am not sure at the moment how much changes that would require. Will take a look.

hrhktkbzyy commented 3 years ago

I will just copy paste what I answered on stackoverflow.

The only proper way to do it with Ebooklib is to read EPUB file and construct new EPUB from scratch by cherry picking what you need from the original file. You were never supposed to read the file, modify it and write it down because we wanted to always end up with valid EPUB3 and our approach was "I will ignore all the garbage metadata, extra files, just take what I need and keep my layout of the folders".

That being said, that was for the online publishing system we worked on. Using Ebooklib outside of the system it does make a lot of sense to be able to do something like that. I am not sure at the moment how much changes that would require. Will take a look.

Hi @aerkalov, Thank you for your feedback. I think I can iterate all the stylesheet/js files and compose them into HTML files again.

Just one question, is there a way to get the content inside the <head> tag from the original HTML files? Obviously, the get_content() method will lose all the content inside <head> tag.

aerkalov commented 3 years ago

@hrhktkbzyy you can use .content, it should have original content.

Here is an example how it was used to import EPUB files into this publishing system. The idea is that it also rewrites links to chapters and images to the new directory structure - https://github.com/booktype/Booktype/blob/master/lib/booktype/importer/epub/epubimporter.py

We had more complex examples where we also change the css files but it is all hidden somewhere behind locked repositories. Besides fixing this issue will take a look to expand the docs and maybe sample files with this.

zalum commented 3 years ago

@aerkalov

When I use epubItem.content I get indeed the metadata and styles, but again when I call epub.write_epub the write the file to the disk, internally it uses the epubItem.get_content which strips the metadata and style. So it does not help. I cannot even set my own style and metadata because the write method will flush it.

aerkalov commented 3 years ago

@zalum Yeah, I understand what you mean. Will try to do simple version of what we were using and put it as samples or something.

cryzed commented 3 years ago

Same issue as #76. I also just ran into this simply by wanting to create an e-book. My EpubHtml instances have content which references stylesheets, but those <link> tags are stripped out of the content, wherever I place them. The samples show how to add CSS stylesheets using EpubItem, but apparently there's no way to actually make use of the included stylesheets?

EDIT: Nevermind, there is a way, calling add_item() on the EpubHtml instance -- still a bit unintuitive. I also hope to see this fixed!

0x6f677548 commented 2 years ago

Body id and original style are also lost. I understand the reason behind this option, but it would be great to use this lib to change some details on existing epubs. What about changing the behavior as an option? I want to propose a new option on write_pub/EPubWriter to change this behavior. We could then use that option to store content instead of get_content (_write_items). Something like:

           elif item.manifest:
                if self.options.get('html_write_using_document_content'):
                    self.out.writestr('%s/%s' % (self.book.FOLDER_NAME, item.file_name), item.content)
                else:
                    self.out.writestr('%s/%s' % (self.book.FOLDER_NAME, item.file_name), item.get_content())

That would allow us to copy an ebook without losing original styles (and eventually change some slight details inside it), like this:

book = epub.read_epub(filename)
options = {'html_write_using_document_content': True}
epub.write_epub(f'{filename}-copy.epub', book, options)

What do you think?

aerkalov commented 2 years ago

@0x6f677548 I think something like this is good idea and a lot of people has asked for that.

When the basic concept and code was created it was used only for creating new books and loading 99% of books which were failing basic epubcheck. So it made a lot of sense back then.

A lot of people asked for this and I do have some code around which does book clone. So you just clone stuff you need and work on a new instance of the book. That was quick solution for it. I rewrote some things but stopped. Will finish cleaning up some PR from people and return to it.

darkranger-red commented 1 year ago

Since this issue is still open and has been mentioned recently. I would like to point out that if modifying the EPUB file is the main purpose of your work, then use the zipfile module will be more suitable than use ebooklib.

Maybe something like that:

input_archive = zipfile.ZipFile("original.epub", "r")
output_archive = zipfile.ZipFile("modified.epub", "w")
file_list = self.input_archive.infolist()

for x in range(0, len(file_list)):
    item = input_archive.open(file_list[x])
    content = item.read()

    if file_list[x].filename.endswith(".xhtml"):
        #Do any 'modification' you like, and write to the XHTML file:
        output_archive.writestr(file_list[x].filename, modification)
    else:
        #For the other file types, simply copy the original content:
        output_archive.writestr(file_list[x].filename, content)

input_archive.close()
output_archive.close()
FreeHK-Lunity commented 1 year ago

small peek at that book cloning thingy?

c1924959470 commented 1 year ago

   您好!您的来信我已接受,我会尽快回复您。