aerkalov / ebooklib

Python E-book library for handling books in EPUB2/EPUB3 format -
https://ebooklib.readthedocs.io/
GNU Affero General Public License v3.0
1.49k stars 234 forks source link

In some HTML elements, attribute names need to be case-sensitive to take effect, for example, the viewBox attribute within the <svg> element. #295

Closed changyy closed 11 months ago

changyy commented 11 months ago

294

Based on the recent development of the EPUB format in recent years, more and more people are using SVG for image layout in XHTML. Currently, when using SVG, attributes within the viewBox are converted to lowercase, and during testing in Apple Books or Chrome Browser, it was found that <svg viewbox="0 0 960 1080"> is ineffective until it is adjusted to <svg viewBox="0 0 960 1080">.

Upon investigation, it was found that the issue originates from the Python lxml package. After processing with html.document_fromstring, the attributes are converted to lowercase. While this aligns with XML conventions, it is not suitable for HTML5.

Currently, in the parse_html_string function within ebooklib/utils.py, there is an attempt to perform a round of checks on the html_tree to handle the attributes of elements that need to be converted to uppercase.

The list of attributes comes from: https://www.w3.org/TR/SVG/attindex.html

Javascript code:

let targetList = {};
document.querySelectorAll("body > table > tbody > tr").forEach(function(trElement) {
    var target = trElement.querySelector("th > span > a > span");
    if (target && target.textContent && /[A-Z]/.test(target.textContent)) {
        targetList[target.textContent.toLowerCase()] = target.textContent;
    }
})
JSON.stringify(targetList, null, 2);