Closed dougmassay closed 1 month ago
Please test the cases where both a single double quote or a pair of double quotes is/are embedded inside a set of single quotes, and visa-versa.
Both of these cases can happen often as they can be found in img alt strings.
There are legal html attributes without quotes but xhtml requires both the = sign and a set of matched quotes (either single or double).
These all pass the proposed new sanitycheck.py
<p class="blah"> </p>
<p class='blah'> </p>
<p class='bl"ah'> </p>
<p class='bl"a"h'> </p>
<p class="bl'ah"> </p>
<p class="bl'a'h"> </p>
These all fail:
<p class=blah> </p>
<p class="blah> </p>
<p class=blah"> </p>
<p class='blah"> </p>
<p class="blah'> </p>
Oddly enough... these pass (two double-quotes and two single-quotes):
<p class=""blah> </p>
<p class=''blah> </p>
while these fail:
<p class=blah""> </p>
<p class=blah''> </p>
But I suspect those last four have always been the case.
Sounds good. Please push all the fixes to master as I am still over a week away from returning.
Reverted some of my first commit because of: https://www.mobileread.com/forums/showthread.php?p=4456840#post4456840
I'm going to wait a bit on more testing. I don't want to make a mistake that just moves the problem somewhere else.
FWIW, the html meta charset info and media type should be stripped out in Sigil's first xhtml file input process someplace (or it used to get stripped out) as both are misleading for xhtml in an epub, as Sigil will always convert to utf-8 and add the xml header for charset and the media type is xhtml as not text/html.
Perhaps the process that stripped things out when loading a file got lost over time. Having both can be misleading. Does Mend remove them?
From what I can see, Mend ignores theses two things. They are not stripped out when first opened either.
I must have lost the code in CleanSource that used to do that somehow. I will investigate this when I return, if you do not beat me to it.
I will look into if and how html meta charset info and media type should be stripped out.
Sounds good. As it stands, this pull request will allow the meta charset. I'm not sure how important it is for sanity.py to allow this or fail it. Especially if we return to stripping it out on import/mend. I don't know about you, but I'd rather the Epub Well-formed Check not get overly ambitious in what it looks for. It was never intended as full-blown validator.
Yes that sounds good. The stripping out of html charset if it is not utf-8 and bad html media type info "text/html" really is something for CleanSource / Mend anyway.
So please merge this pr and I will separately look to see how/if CleanSource should strip out obsolete charset and media type meta info.
FWIW, CleanSource::RemoveMetaCharset still exists in Sigil CleanSource.cpp.
It was last invoked in Sigil 0.8.x series by the XHTMLTidy clean routine. It got lost with our move to gumbo.
I will add it back to the Mend and Mend and Prettify routine which is the current equivalent.
I'll ping @BeckyDTP to see if she wants to add her piece that catches no quotes around an attribute. Otherwise, I'll add it.
I see the discussion. I have to go shopping now, so feel free to add those few lines (as long as they don't cause problems).
Those lines have been all the testing everyone has been doing, so I think they're fine. I'll add them. Thanks for the contribution!
Any reason why sanitycheck.py shouldn't (or can't) look for both missing opening AND closing attribute quotes like xmlsanitycheck.py does?
https://www.mobileread.com/forums/showthread.php?p=4456608#post4456608