Sigil-Ebook / Sigil

Sigil is a multi-platform EPUB ebook editor
GNU General Public License v3.0
5.96k stars 578 forks source link

[Bug]: Degradation of HTML tags in Description #719

Closed ketalar closed 1 year ago

ketalar commented 1 year ago

Bug Description

Sorry I'm not a developer and I don't speak English. My text is from an automatic translation. I hope the message reaches you, because I'm having trouble understanding this Github page. When I open Sigil to modify the cover or a typo in the book, the Description metadata which is formatted with HTML tags, once the above-mentioned modifications have been made, is degraded. Example the following metadata as input:

Summary

become after recording:

bSummary/b/p Hope you are helpful. Cordially. ORIGINAL Désolé je ne suis pas développeur et je ne parle pas anglais. Mon texte est issu d'une traduction automatique. J'espère que le message vous parviendra, car j'au du mal à comprendre cette page Github. Lorsque j'ouvre Sigil pour modifier la couverture ou une faute de frappe dans le livre, les métadonnées de Description qui sont formatées avec des balises HTML, une fois les modifications sus-citées faites sont dégradées. Exemple les métadonnées suivantes en entrée :

Résumé

deviennent après enregistrement :

bRésumé/b/p En espérant vous êtes utile. Cordialement. [Sigil_erreur_tags_html.zip](https://github.com/Sigil-Ebook/Sigil/files/12238636/Sigil_erreur_tags_html.zip) ### Platform (OS) Windows (Default) ### OS Version / Specifics 64 bits Professionnel ### What version of Sigil are you using? 1.9.30 ### Any backtraces or crash reports ```shell None/Aucun ```

BeckyDTP commented 1 year ago

I confirm (on built 2.0.0 too).

Everything is shown in the video from @ketalar The error can be triggered faster.

  1. open original.epub
  2. delete cover.jpg
  3. rename logonrf.jpg to cover.jpg
  4. save epub file The description gets corrupted.
dougmassay commented 1 year ago

I'm not quite following. Html tags are not allowed in the metadata description. The opf file is xml. The only way to safely include html tags in the description metadata is to use Sigil's metadata editor (which will safely xml escape any tags that are entered).

If any unescaped tags already incorrectly preexist in an epub, they will certainly be corrupted when Sigil parses the opf in question.

BeckyDTP commented 1 year ago

In the OPF file, the HTML tags are written correctly. <dc:description>&lt;p&gt;Divorcé depuis peu etc.

Metadata Editor allows you to view and edit them.

But this case is different. Correctly saved tags are destroyed under special conditions. It's as if an incomplete (or double-crown) conversion is taking place.

BeckyDTP commented 1 year ago

It seems to be related to this particular file/description, because I can't prepare the minimum file to demonstrate the issue.

dougmassay commented 1 year ago

I see now that the description's html is indeed already xml escaped in the original. Something strange happening for sure.

dougmassay commented 1 year ago

Perhaps something to do with the entire metadata section of the opf being one massive line with no line-breaks?

BeckyDTP commented 1 year ago

I already know.

The &nbsp; entity in the description in the metadata after renaming the image file is written as plain, while it should be written as &amp;nbsp;

Saving the file cuts out the remaining "internal" entities.

dougmassay commented 1 year ago

Put &nbsp; in Preserve Entities.

Nevermind. It seemed to work under certain conditions, but I can't repeat it.

dougmassay commented 1 year ago

It's not the escaped &nbsp; entity either. I've replace all of the escaped non-breaking space entities with normal spaces and the issue persists.

kevinhendricks commented 1 year ago

Could be related to the relatively new OPFParser code that came in a few releases back. Has anyone tested on Sigil 1.3 or something older to see if I introduced this bug when I ported the OPF Parsing code from Python to C++.

I will try to help track this one down when I get to a computer in a few hours.

BeckyDTP commented 1 year ago
  1. open the file from the attachment.
  2. open the OPF file tab. Watch the description in the metadata all the time.
  3. rename cover.jpg to 1.jpg
  4. rename 1.jpg to 2.jpg (sometimes you have to do the renaming a third time). The description is destroyed.

sample-ebook2-epub.zip

kevinhendricks commented 1 year ago

Thanks for the test case. Each time a file is renamed, the entire opf is sanity checked, then parsed, then rewritten. The sanity check can and will introduce changes if it thinks it needs them based on lxml xml parsing.

I will try to track this one down.

kevinhendricks commented 1 year ago

On my machine:

Original description metadata in opf:
<dc:description>&lt;p&gt;&lt;b&gt;Diap&amp;nbsp;Dealer&lt;/b&gt;&lt;/p&gt;</dc:description>

After Step 3:
<dc:description>&lt;p&gt;&lt;b&gt;Diap&nbsp;Dealer&lt;/b&gt;&lt;/p&gt;</dc:description>

(notice the nbsp has not been properly escaped after just this first step)

After Step 4:
<dc:description>&lt;p&gt;&lt;b&gt;DiapDealer/b/p</dc:description>

Now the question is where is the unescaped nbsp coming from.

dougmassay commented 1 year ago

It seems all properly escaped tags/entities after the first offending nbsp is ultimately borked.

dougmassay commented 1 year ago

My guess is that it's not specifically the nbsp entity, but rather that any properly escaped NAMED entity will trigger the issue.

kevinhendricks commented 1 year ago

Agreed.

I have tracked through the OPFParser code and it successfully leaves the &nbsp; alone. The problem actually occurs in the UniversalUpdates routine that happens every time a file is renamed.

This ends up calling out to the embedded Python xmlprocessor.py code:

QString PerformOPFUpdates::operator()() { QString newsource = m_source;

// serialize the hash for passing to python                                                                         
QStringList dictkeys = m_XMLUpdates.keys();
QStringList dictvals;
foreach(QString key, dictkeys) {
    dictvals.append(m_XMLUpdates.value(key));
}

int rv = 0;
QString error_traceback;

QList<QVariant> args;
args.append(QVariant(m_source));
args.append(QVariant(m_newbookpath));
args.append(QVariant(m_CurrentPath));
args.append(QVariant(dictkeys));
args.append(QVariant(dictvals));

EmbeddedPython * epython  = EmbeddedPython::instance();

QVariant res = epython->runInPython( QString("xmlprocessor"),
                                     QString("performOPFSourceUpdates"),
                                     args,
                                     &rv,
                                     error_traceback);

So somehow this code reads in the OPF and messes things up by returning   and not &nbsp;

This is the code that uses sanity check and the lxml processor.

I have started looking there now as everything before that left that dc:description contents untouched.

Kevin

On Aug 2, 2023, at 11:29 AM, Doug Massay @.***> wrote:

My guess is that it's not specifically the nbsp entity, but rather that any properly escaped NAMED entity will trigger the issue. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

kevinhendricks commented 1 year ago

Yes, this is the problem routine:

I printed the OPF source upon entry and the new OPF source upon exit and sure enough, the &amp;nbsp; became &nbsp;

performOPFSourceUpdates input:
<?xml version="1.0" encoding="utf-8"?>
<package version="2.0" unique-identifier="BookId" xmlns="http://www.idpf.org/2007/opf">
  <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <dc:description>&lt;p&gt;&lt;b&gt;Diap&amp;nbsp;Dealer&lt;/b&gt;&lt;/p&gt;</dc:description>
    <dc:language>en</dc:language>
    <dc:title>Sample e-book</dc:title>
    <dc:date opf:event="modification" xmlns:opf="http://www.idpf.org/2007/opf">2023-08-02</dc:date>
    <dc:identifier opf:scheme="UUID" id="BookId">urn:uuid:23fcbb08-a645-4bbf-ac1d-92e37f6182bf</dc:identifier>
    <meta name="cover" content="x1.jpg" />
  </metadata>
  <manifest>
    <item id="Section0001.xhtml" href="t/Section0001.xhtml" media-type="application/xhtml+xml"/>
    <item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
    <item id="cover.xhtml" href="t/cover.xhtml" media-type="application/xhtml+xml"/>
    <item id="x1.jpg" href="i/1.jpg" media-type="image/jpeg"/>
  </manifest>
  <spine toc="ncx">
    <itemref idref="cover.xhtml"/>
    <itemref idref="Section0001.xhtml"/>
  </spine>
  <guide>
    <reference type="cover" title="Cover" href="t/cover.xhtml"/>
  </guide>
</package>

performOPFSourceUpdates output:
<?xml version="1.0" encoding="utf-8" ?>
<package version="2.0" unique-identifier="BookId" xmlns="http://www.idpf.org/2007/opf">
  <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <dc:description>&lt;p&gt;&lt;b&gt;Diap&nbsp;Dealer&lt;/b&gt;&lt;/p&gt;</dc:description>
    <dc:language>en</dc:language>
    <dc:title>Sample e-book</dc:title>
    <dc:date opf:event="modification" xmlns:opf="http://www.idpf.org/2007/opf">2023-08-02</dc:date>
    <dc:identifier opf:scheme="UUID" id="BookId">urn:uuid:23fcbb08-a645-4bbf-ac1d-92e37f6182bf</dc:identifier>
    <meta name="cover" content="x1.jpg"/>
  </metadata>
  <manifest>
    <item id="Section0001.xhtml" href="t/Section0001.xhtml" media-type="application/xhtml+xml"/>
    <item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
    <item id="cover.xhtml" href="t/cover.xhtml" media-type="application/xhtml+xml"/>
    <item id="x1.jpg" href="i/1.jpg" media-type="image/jpeg"/>
  </manifest>
  <spine toc="ncx">
    <itemref idref="cover.xhtml"/>
    <itemref idref="Section0001.xhtml"/>
  </spine>
  <guide>
    <reference type="cover" title="Cover" href="t/cover.xhtml"/>
  </guide>
</package>

So this is a bug in our BS4 or lxml or something we just never detected before.

kevinhendricks commented 1 year ago

For this routine to work, we really need to leave all xml content untouched and not try to detect nbsp or any other entities.

I think the issue is in the decode xml routine

kevinhendricks commented 1 year ago

It seems we turned off resolve_entities as some point for security and then due to use of entities in metadata attribute values we turned them back on with this change:

https://github.com/Sigil-Ebook/Sigil/commit/42866714c5b93b8650388894c0ffb4b8899e82d6

So we need to resolve entities to fix bugs in metadata attributes that use entities and then when we decode that contents of a tag we do not escape them again but I am really unsure how to fix this in decode_xml deep inside sigil_bs4 decode_xml routines.

dougmassay commented 1 year ago

Formatted descriptions were all the rage in the early days, but they never made a lot of sense to me. 1) Most reading apps don't display them, so they usually only get seen by something like calibre. 2) They get filled will self-defeating, problematic cruft that could easily be replaced with pure unicode glyphs instead of entities. I know we have to fix this, but I just don't understand why people who insist on highly formatted descriptions don't simply use unicode (if necessary) and a few simple escaped bold and/or italics tags.

But I digress... :)

dougmassay commented 1 year ago

Disabling resolve_entities seems vaguely familiar, for what it's worth.

kevinhendricks commented 1 year ago

So we need them on for xml attributes entities to work so we can not turn them off. So we just have to undo the damage when we convert the tree back to xml.

Here is a patch to our own sigil_bs4 that seems to do what we want. It uses the approach of splitting and rebuilding that we designed for KindleUnpack when we added support for KF8. So it should be pretty robust.

diff --git a/src/Resource_Files/plugin_launchers/python/sigil_bs4/element.py b/src/Resource_Files/plugin_launchers/python/sigil_bs4/element.py
index cfdf14e53..086e01c12 100644
--- a/src/Resource_Files/plugin_launchers/python/sigil_bs4/element.py
+++ b/src/Resource_Files/plugin_launchers/python/sigil_bs4/element.py
@@ -42,6 +42,9 @@ EBOOK_XML_PARENT_TAGS = ("package","metadata","manifest","spine","guide","ncx",
                          "head","doctitle","docauthor","navmap", "navpoint",
                           "navlabel", "pagelist", "pagetarget") 

+IS_ENTITY = re.compile("(&#\d+;|&#x[0-9a-fA-F]+;|&\w+;)")
+
+
 def _alias(attr):
     """Alias one attribute name to another for backward compatibility"""
     @property
@@ -1324,6 +1327,15 @@ class Tag(PageElement):
                 s.append(val)
             if text:
                 text = text.strip()
+                # walk the contents and escape html named entities from xml tag contents
+                pieces = IS_ENTITY.split(text);
+                for i in range(1, len(pieces),2):
+                    piece = pieces[i]
+                    if piece not in ["&lt;", "&gt;", "&amp;"]:
+                        # not a xml base entity
+                        piece = "&amp;" + piece[1:]
+                    pieces[i] = piece
+                text = "".join(pieces)
             if text:
                 if is_xmlparent and len(s) == 0:
                     s.append(indent_chars * (indent_level - 1))
kevinhendricks commented 1 year ago

The other way to deal with this is to convert them all to their single characters. Perhaps that would be safer? But it would make it harder for people to hand edit.

Please give what I pushed a try on your favourite epubs with lots of entities and html tags in opf metadata tags (and anyplace else in the OPF or NCX) and let me know if it appears to work without breaking anything else.

It it breaks something else I will just resolve the damn things back to their characters.

dougmassay commented 1 year ago

Will do. It might be a bit later, though. Plus I don't have a lot of epubs with entities and tags in opf metadata, but I'll see what I can find.

kevinhendricks commented 1 year ago

Based on my admittedly limited testing (I only have about 3 epubs that use tags and entities inside metadata elements in the opf). I think we have fixed this issue in master.

So I am closing this as fixed but if based on testing anyone thinks something is amiss, I would be happy to re-open it and try converting the named and numeric entities to unicode characters.

Thank you for your bug report (and Becky for your simple test case).