Sigil-Ebook / Sigil

Sigil is a multi-platform EPUB ebook editor
GNU General Public License v3.0
5.97k stars 578 forks source link

Make sanitycheck.py (WellFormedCheckEpub) look for both missing quotes #777

Closed dougmassay closed 1 month ago

dougmassay commented 1 month ago

Any reason why sanitycheck.py shouldn't (or can't) look for both missing opening AND closing attribute quotes like xmlsanitycheck.py does?

https://www.mobileread.com/forums/showthread.php?p=4456608#post4456608

kevinhendricks commented 1 month ago

Please test the cases where both a single double quote or a pair of double quotes is/are embedded inside a set of single quotes, and visa-versa.

Both of these cases can happen often as they can be found in img alt strings.

There are legal html attributes without quotes but xhtml requires both the = sign and a set of matched quotes (either single or double).

dougmassay commented 1 month ago

These all pass the proposed new sanitycheck.py

<p class="blah">&nbsp;</p>
  <p class='blah'>&nbsp;</p>
  <p class='bl"ah'>&nbsp;</p>
  <p class='bl"a"h'>&nbsp;</p>
  <p class="bl'ah">&nbsp;</p>
  <p class="bl'a'h">&nbsp;</p>

These all fail:

<p class=blah>&nbsp;</p>
  <p class="blah>&nbsp;</p>
  <p class=blah">&nbsp;</p>
  <p class='blah">&nbsp;</p>
  <p class="blah'>&nbsp;</p>

Oddly enough... these pass (two double-quotes and two single-quotes):

  <p class=""blah>&nbsp;</p>
  <p class=''blah>&nbsp;</p>

while these fail:

  <p class=blah"">&nbsp;</p>
  <p class=blah''>&nbsp;</p>

But I suspect those last four have always been the case.

kevinhendricks commented 1 month ago

Sounds good. Please push all the fixes to master as I am still over a week away from returning.

dougmassay commented 1 month ago

Reverted some of my first commit because of: https://www.mobileread.com/forums/showthread.php?p=4456840#post4456840

I'm going to wait a bit on more testing. I don't want to make a mistake that just moves the problem somewhere else.

kevinhendricks commented 1 month ago

FWIW, the html meta charset info and media type should be stripped out in Sigil's first xhtml file input process someplace (or it used to get stripped out) as both are misleading for xhtml in an epub, as Sigil will always convert to utf-8 and add the xml header for charset and the media type is xhtml as not text/html.

Perhaps the process that stripped things out when loading a file got lost over time. Having both can be misleading. Does Mend remove them?

dougmassay commented 1 month ago

From what I can see, Mend ignores theses two things. They are not stripped out when first opened either.

kevinhendricks commented 1 month ago

I must have lost the code in CleanSource that used to do that somehow. I will investigate this when I return, if you do not beat me to it.

kevinhendricks commented 1 month ago

I will look into if and how html meta charset info and media type should be stripped out.

dougmassay commented 1 month ago

Sounds good. As it stands, this pull request will allow the meta charset. I'm not sure how important it is for sanity.py to allow this or fail it. Especially if we return to stripping it out on import/mend. I don't know about you, but I'd rather the Epub Well-formed Check not get overly ambitious in what it looks for. It was never intended as full-blown validator.

kevinhendricks commented 1 month ago

Yes that sounds good. The stripping out of html charset if it is not utf-8 and bad html media type info "text/html" really is something for CleanSource / Mend anyway.

So please merge this pr and I will separately look to see how/if CleanSource should strip out obsolete charset and media type meta info.

kevinhendricks commented 1 month ago

FWIW, CleanSource::RemoveMetaCharset still exists in Sigil CleanSource.cpp.

It was last invoked in Sigil 0.8.x series by the XHTMLTidy clean routine. It got lost with our move to gumbo.

I will add it back to the Mend and Mend and Prettify routine which is the current equivalent.

dougmassay commented 1 month ago

I'll ping @BeckyDTP to see if she wants to add her piece that catches no quotes around an attribute. Otherwise, I'll add it.

BeckyDTP commented 1 month ago

I see the discussion. I have to go shopping now, so feel free to add those few lines (as long as they don't cause problems).

dougmassay commented 1 month ago

Those lines have been all the testing everyone has been doing, so I think they're fine. I'll add them. Thanks for the contribution!