Error when parsing XHTML

laggardkernel commented 5 years ago

A problem occurred when I was parsing an XHML file from index page of pipenv doc: https://pipenv.readthedocs.io/en/latest/

Here's part of the codes I'm using:

def remove_pics(filename):
    doc = pq(filename=filename, parser="html")
    remove_list = [
        "div.section > h1 + img",
        "div.section > h1 ~ a.image-reference",
        "body div.footer + a",
    ]
    for item in remove_list:
        temp = doc.find(item)
        if temp:
            print(temp)
            temp.remove()

    with open(filename, "w+") as f:
        f.write(str(doc))

Basically, I'm trying to remove the badges from the XHTML file with the help of html parser.

The original header and <!DOCTYPE> are displayed below:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Pipenv: Python Dev Workflow for Humans &#8212; pipenv 2018.11.14 documentation</title>
    <link rel="stylesheet" href="_static/alabaster.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    <link rel="stylesheet" href="_static/custom.css" type="text/css" />
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    './',
        VERSION:     '2018.11.14',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true,
        SOURCELINK_SUFFIX: '.txt'
      };
    </script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>

After running the script once, badges were removed successfully. I noticed the <!DOCTYPE> was omitted by f.write(str(doc)):

<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Pipenv: Python Dev Workflow for Humans — pipenv 2018.11.14 documentation</title>
    <link rel="stylesheet" href="_static/alabaster.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    <link rel="stylesheet" href="_static/custom.css" type="text/css" />
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    './',
        VERSION:     '2018.11.14',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true,
        SOURCELINK_SUFFIX: '.txt'
      };
    </script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>

The more confusing thing is, after running the script again (the 2nd time), of course there's no badges to be removed, the style of the XHTML file was changed once more. <html> attribute xmlns was omitted, <script></script> tag was changed as <script/>:

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    <title>Pipenv: Python Dev Workflow for Humans — pipenv 2018.11.14 documentation</title>
    <link rel="stylesheet" href="_static/alabaster.css" type="text/css"/>
    <link rel="stylesheet" href="_static/pygments.css" type="text/css"/>
    <link rel="stylesheet" href="_static/custom.css" type="text/css"/>
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    './',
        VERSION:     '2018.11.14',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true,
        SOURCELINK_SUFFIX: '.txt'
      };
    </script>
    <script type="text/javascript" src="_static/jquery.js"/>
    <script type="text/javascript" src="_static/underscore.js"/>
    <script type="text/javascript" src="_static/doctools.js"/>

Obviously, the 2nd running of the script made the XHTML invalid. I couldn't figure out what's wrong. Is this caused by the parser paser='html', or my wrong use of pyquery to modify local XHTML file by f.write(str(doc))?

gawel commented 5 years ago

I'ts probably caused by some underlying lxml behaviours. It's always hard to figure. I think that lxml do not print xml headers (doctype) and remove one of the xmlns because they use the same attribute name.

As I remember using f.write(doc.outerHtml()) should solve your second problem.

laggardkernel commented 5 years ago

@gawel Thanks for ur help. doc.outerHtml() do solve my problem, it ensures tags closed correctly in the XHTML file. I don't care about the dropping of <!DOCTYPE> thing very much.

To find out why .outerHtml() makes a difference, I did some digging in the source of pyquery.pyquery.

    def __str__(self):
        encoding = str if PY3k else None
        return ''.join([etree.tostring(e, encoding=encoding) for e in self])

...

    @with_camel_case_alias
    def outer_html(self, method="html"):
        """Get the html representation of the first selected element::

            >>> d = PyQuery('<div><span class="red">toto</span> rocks</div>')
            >>> print(d('span'))
            <span class="red">toto</span> rocks
            >>> print(d('span').outer_html())
            <span class="red">toto</span>
            >>> print(d('span').outerHtml())
            <span class="red">toto</span>

            >>> S = PyQuery('<p>Only <b>me</b> & myself</p>')
            >>> print(S('b').outer_html())
            <b>me</b>

        ..
        """

        if not self:
            return None
        e0 = self[0]
        if e0.tail:
            e0 = deepcopy(e0)
            e0.tail = ''
        return etree.tostring(e0, encoding=text_type, method=method)

It seems etree.tostring(..., method='html') in .outerHtml() helped me get a valid XHTML. Then I consolidated my guessing with a looking into source of lxml:

In [6]: help(etree.tostring)
Help on cython_function_or_method in module lxml.etree:

tostring(element_or_tree, *, encoding=None, method='xml', xml_declaration=None, pretty_print=False, with_tail=True, standalone=None, doctype=None, exclusive=False, with_comments=True, inclusive_ns_prefixes=None)
...

As you can see, lxml.etree.tostring() convert objects as xml strings by default. Is is better that we keep the content as what is parsed as, in my case I parsed the content as html, when trying convert the content as strings in __str__()?

gawel commented 5 years ago

Yeah the method should be preserved. As I remember it's hard because we use self.__class__() a lot. Mean that you may have lost the original method when trying to print only an extracted part of the doc, not the whole doc.

laggardkernel commented 5 years ago

@gawel I see it. Talking about the DOCTYPE stuff, I find something that may be helpful.

The Element object from lxml doesn't preserve the DOCTYPE and DTD content. If we need to keep them, we should parse the input as ElementTree, and convert the ElementTree object as strings will get the whole document.

The ElementTree class, doc of lxml

One of the important differences is that the ElementTree class serialises as a complete document, as opposed to a single Element. This includes top-level processing instructions and comments, as well as a DOCTYPE and other DTD content in the document

In [1]: from lxml import etree

In [2]: root = etree.XML('''\
   ...: ... <?xml version="1.0"?>
   ...: ... <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "parsnips"> ]>
   ...: ... <root>
   ...: ...   <a>&tasty;</a>
   ...: ... </root>
   ...: ... ''')

In [3]: tree = etree.ElementTree(root)

In [4]: tree.docinfo.public_id = '-//W3C//DTD XHTML 1.0 Transitional//EN'

In [5]: >>> tree.docinfo.system_url = 'file://local.dtd'

In [6]: tree.docinfo.system_url = 'file://local.dtd'

In [8]: print(etree.tounicode(tree))
<!DOCTYPE root PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "file://local.dtd" [
<!ENTITY tasty "parsnips">
]>
<root>
  <a>parsnips</a>
</root>

In [10]: result=tree.getroot()

In [11]: type(result)
Out[11]: lxml.etree._Element

In [12]: print(etree.tounicode(result))
<root>
  <a>parsnips</a>
</root>

In [6]: from io import StringIO

In [7]: root = etree.parse(StringIO('''\
   ...: <?xml version="1.0"?>
   ...: <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "parsnips"> ]>
   ...: <root>
   ...:   <a>&tasty;</a>
   ...: </root>
   ...: '''))

In [8]: type(root)
Out[8]: lxml.etree._ElementTree

In [10]: print(etree.tounicode(root))
<!DOCTYPE root SYSTEM "test" [
<!ENTITY tasty "parsnips">
]>
<root>
  <a>parsnips</a>
</root>

gawel commented 5 years ago

pyquery already use etree https://github.com/gawel/pyquery/blob/master/pyquery/pyquery.py#L96

The problem is that html is not xml and most web pages fails

laggardkernel commented 5 years ago

I know PyQuery is using etree from lxml to parse input. I'm trying to emphasize the difference between etree.parse() and etree.fromstring(), which could explain why <!DOCTYPE> is lost in my case.

etree.parse() accepts files or file-like objects as input, and converts the input as instances of ElementTree. etree.fromstring() accepts strings as input, and converts the input as instances of Element. (ElementTree and Element are classes from lxml). The ElementTree keeps the <!DOCTYPE> info, but the Element ignores it.

In the source of pyquery.py, files are parsed with etree.parse() and an instance of ElementTree is returned. Calling a .getroot() on it returns the <html> Element. Hence the DOCTYPE is lost. https://github.com/gawel/pyquery/blob/master/pyquery/pyquery.py#L123

gawel commented 5 years ago

I see. Don't know why it use getroot. this is some pretty old code. probably here since the first commit :D

gawel / pyquery

Error when parsing XHTML #199