Closed ketalar closed 1 year ago
I confirm (on built 2.0.0 too).
Everything is shown in the video from @ketalar The error can be triggered faster.
I'm not quite following. Html tags are not allowed in the metadata description. The opf file is xml. The only way to safely include html tags in the description metadata is to use Sigil's metadata editor (which will safely xml escape any tags that are entered).
If any unescaped tags already incorrectly preexist in an epub, they will certainly be corrupted when Sigil parses the opf in question.
In the OPF file, the HTML tags are written correctly.
<dc:description><p>Divorcé depuis peu etc.
Metadata Editor allows you to view and edit them.
But this case is different. Correctly saved tags are destroyed under special conditions. It's as if an incomplete (or double-crown) conversion is taking place.
It seems to be related to this particular file/description, because I can't prepare the minimum file to demonstrate the issue.
I see now that the description's html is indeed already xml escaped in the original. Something strange happening for sure.
Perhaps something to do with the entire metadata section of the opf being one massive line with no line-breaks?
I already know.
The
entity in the description in the metadata after renaming the image file is written as plain, while it should be written as &nbsp;
Saving the file cuts out the remaining "internal" entities.
Put
in Preserve Entities.
Nevermind. It seemed to work under certain conditions, but I can't repeat it.
It's not the escaped
entity either. I've replace all of the escaped non-breaking space entities with normal spaces and the issue persists.
Could be related to the relatively new OPFParser code that came in a few releases back. Has anyone tested on Sigil 1.3 or something older to see if I introduced this bug when I ported the OPF Parsing code from Python to C++.
I will try to help track this one down when I get to a computer in a few hours.
Thanks for the test case. Each time a file is renamed, the entire opf is sanity checked, then parsed, then rewritten. The sanity check can and will introduce changes if it thinks it needs them based on lxml xml parsing.
I will try to track this one down.
On my machine:
Original description metadata in opf:
<dc:description><p><b>Diap&nbsp;Dealer</b></p></dc:description>
After Step 3:
<dc:description><p><b>Diap Dealer</b></p></dc:description>
(notice the nbsp has not been properly escaped after just this first step)
After Step 4:
<dc:description><p><b>DiapDealer/b/p</dc:description>
Now the question is where is the unescaped nbsp coming from.
It seems all properly escaped tags/entities after the first offending nbsp is ultimately borked.
My guess is that it's not specifically the nbsp entity, but rather that any properly escaped NAMED entity will trigger the issue.
Agreed.
I have tracked through the OPFParser code and it successfully leaves the alone. The problem actually occurs in the UniversalUpdates routine that happens every time a file is renamed.
This ends up calling out to the embedded Python xmlprocessor.py code:
QString PerformOPFUpdates::operator()() { QString newsource = m_source;
// serialize the hash for passing to python
QStringList dictkeys = m_XMLUpdates.keys();
QStringList dictvals;
foreach(QString key, dictkeys) {
dictvals.append(m_XMLUpdates.value(key));
}
int rv = 0;
QString error_traceback;
QList<QVariant> args;
args.append(QVariant(m_source));
args.append(QVariant(m_newbookpath));
args.append(QVariant(m_CurrentPath));
args.append(QVariant(dictkeys));
args.append(QVariant(dictvals));
EmbeddedPython * epython = EmbeddedPython::instance();
QVariant res = epython->runInPython( QString("xmlprocessor"),
QString("performOPFSourceUpdates"),
args,
&rv,
error_traceback);
So somehow this code reads in the OPF and messes things up by returning and not
This is the code that uses sanity check and the lxml processor.
I have started looking there now as everything before that left that dc:description contents untouched.
Kevin
On Aug 2, 2023, at 11:29 AM, Doug Massay @.***> wrote:
My guess is that it's not specifically the nbsp entity, but rather that any properly escaped NAMED entity will trigger the issue. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Yes, this is the problem routine:
I printed the OPF source upon entry and the new OPF source upon exit and sure enough, the &nbsp;
became
performOPFSourceUpdates input:
<?xml version="1.0" encoding="utf-8"?>
<package version="2.0" unique-identifier="BookId" xmlns="http://www.idpf.org/2007/opf">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
<dc:description><p><b>Diap&nbsp;Dealer</b></p></dc:description>
<dc:language>en</dc:language>
<dc:title>Sample e-book</dc:title>
<dc:date opf:event="modification" xmlns:opf="http://www.idpf.org/2007/opf">2023-08-02</dc:date>
<dc:identifier opf:scheme="UUID" id="BookId">urn:uuid:23fcbb08-a645-4bbf-ac1d-92e37f6182bf</dc:identifier>
<meta name="cover" content="x1.jpg" />
</metadata>
<manifest>
<item id="Section0001.xhtml" href="t/Section0001.xhtml" media-type="application/xhtml+xml"/>
<item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
<item id="cover.xhtml" href="t/cover.xhtml" media-type="application/xhtml+xml"/>
<item id="x1.jpg" href="i/1.jpg" media-type="image/jpeg"/>
</manifest>
<spine toc="ncx">
<itemref idref="cover.xhtml"/>
<itemref idref="Section0001.xhtml"/>
</spine>
<guide>
<reference type="cover" title="Cover" href="t/cover.xhtml"/>
</guide>
</package>
performOPFSourceUpdates output:
<?xml version="1.0" encoding="utf-8" ?>
<package version="2.0" unique-identifier="BookId" xmlns="http://www.idpf.org/2007/opf">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
<dc:description><p><b>Diap Dealer</b></p></dc:description>
<dc:language>en</dc:language>
<dc:title>Sample e-book</dc:title>
<dc:date opf:event="modification" xmlns:opf="http://www.idpf.org/2007/opf">2023-08-02</dc:date>
<dc:identifier opf:scheme="UUID" id="BookId">urn:uuid:23fcbb08-a645-4bbf-ac1d-92e37f6182bf</dc:identifier>
<meta name="cover" content="x1.jpg"/>
</metadata>
<manifest>
<item id="Section0001.xhtml" href="t/Section0001.xhtml" media-type="application/xhtml+xml"/>
<item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
<item id="cover.xhtml" href="t/cover.xhtml" media-type="application/xhtml+xml"/>
<item id="x1.jpg" href="i/1.jpg" media-type="image/jpeg"/>
</manifest>
<spine toc="ncx">
<itemref idref="cover.xhtml"/>
<itemref idref="Section0001.xhtml"/>
</spine>
<guide>
<reference type="cover" title="Cover" href="t/cover.xhtml"/>
</guide>
</package>
So this is a bug in our BS4 or lxml or something we just never detected before.
For this routine to work, we really need to leave all xml content untouched and not try to detect nbsp or any other entities.
I think the issue is in the decode xml routine
It seems we turned off resolve_entities as some point for security and then due to use of entities in metadata attribute values we turned them back on with this change:
https://github.com/Sigil-Ebook/Sigil/commit/42866714c5b93b8650388894c0ffb4b8899e82d6
So we need to resolve entities to fix bugs in metadata attributes that use entities and then when we decode that contents of a tag we do not escape them again but I am really unsure how to fix this in decode_xml deep inside sigil_bs4 decode_xml routines.
Formatted descriptions were all the rage in the early days, but they never made a lot of sense to me. 1) Most reading apps don't display them, so they usually only get seen by something like calibre. 2) They get filled will self-defeating, problematic cruft that could easily be replaced with pure unicode glyphs instead of entities. I know we have to fix this, but I just don't understand why people who insist on highly formatted descriptions don't simply use unicode (if necessary) and a few simple escaped bold and/or italics tags.
But I digress... :)
Disabling resolve_entities seems vaguely familiar, for what it's worth.
So we need them on for xml attributes entities to work so we can not turn them off. So we just have to undo the damage when we convert the tree back to xml.
Here is a patch to our own sigil_bs4 that seems to do what we want. It uses the approach of splitting and rebuilding that we designed for KindleUnpack when we added support for KF8. So it should be pretty robust.
diff --git a/src/Resource_Files/plugin_launchers/python/sigil_bs4/element.py b/src/Resource_Files/plugin_launchers/python/sigil_bs4/element.py
index cfdf14e53..086e01c12 100644
--- a/src/Resource_Files/plugin_launchers/python/sigil_bs4/element.py
+++ b/src/Resource_Files/plugin_launchers/python/sigil_bs4/element.py
@@ -42,6 +42,9 @@ EBOOK_XML_PARENT_TAGS = ("package","metadata","manifest","spine","guide","ncx",
"head","doctitle","docauthor","navmap", "navpoint",
"navlabel", "pagelist", "pagetarget")
+IS_ENTITY = re.compile("(&#\d+;|&#x[0-9a-fA-F]+;|&\w+;)")
+
+
def _alias(attr):
"""Alias one attribute name to another for backward compatibility"""
@property
@@ -1324,6 +1327,15 @@ class Tag(PageElement):
s.append(val)
if text:
text = text.strip()
+ # walk the contents and escape html named entities from xml tag contents
+ pieces = IS_ENTITY.split(text);
+ for i in range(1, len(pieces),2):
+ piece = pieces[i]
+ if piece not in ["<", ">", "&"]:
+ # not a xml base entity
+ piece = "&" + piece[1:]
+ pieces[i] = piece
+ text = "".join(pieces)
if text:
if is_xmlparent and len(s) == 0:
s.append(indent_chars * (indent_level - 1))
The other way to deal with this is to convert them all to their single characters. Perhaps that would be safer? But it would make it harder for people to hand edit.
Please give what I pushed a try on your favourite epubs with lots of entities and html tags in opf metadata tags (and anyplace else in the OPF or NCX) and let me know if it appears to work without breaking anything else.
It it breaks something else I will just resolve the damn things back to their characters.
Will do. It might be a bit later, though. Plus I don't have a lot of epubs with entities and tags in opf metadata, but I'll see what I can find.
Based on my admittedly limited testing (I only have about 3 epubs that use tags and entities inside metadata elements in the opf). I think we have fixed this issue in master.
So I am closing this as fixed but if based on testing anyone thinks something is amiss, I would be happy to re-open it and try converting the named and numeric entities to unicode characters.
Thank you for your bug report (and Becky for your simple test case).
Bug Description
Sorry I'm not a developer and I don't speak English. My text is from an automatic translation. I hope the message reaches you, because I'm having trouble understanding this Github page. When I open Sigil to modify the cover or a typo in the book, the Description metadata which is formatted with HTML tags, once the above-mentioned modifications have been made, is degraded. Example the following metadata as input:
Summary
become after recording:
bSummary/b/p Hope you are helpful. Cordially. ORIGINAL Désolé je ne suis pas développeur et je ne parle pas anglais. Mon texte est issu d'une traduction automatique. J'espère que le message vous parviendra, car j'au du mal à comprendre cette page Github. Lorsque j'ouvre Sigil pour modifier la couverture ou une faute de frappe dans le livre, les métadonnées de Description qui sont formatées avec des balises HTML, une fois les modifications sus-citées faites sont dégradées. Exemple les métadonnées suivantes en entrée :
Résumé
deviennent après enregistrement :bRésumé/b/p En espérant vous êtes utile. Cordialement. [Sigil_erreur_tags_html.zip](https://github.com/Sigil-Ebook/Sigil/files/12238636/Sigil_erreur_tags_html.zip) ### Platform (OS) Windows (Default) ### OS Version / Specifics 64 bits Professionnel ### What version of Sigil are you using? 1.9.30 ### Any backtraces or crash reports ```shell None/Aucun ```