Open laggardkernel opened 5 years ago
I'ts probably caused by some underlying lxml behaviours. It's always hard to figure. I think that lxml do not print xml headers (doctype) and remove one of the xmlns because they use the same attribute name.
As I remember using f.write(doc.outerHtml())
should solve your second problem.
@gawel Thanks for ur help. doc.outerHtml()
do solve my problem, it ensures tags closed correctly in the XHTML file. I don't care about the dropping of <!DOCTYPE>
thing very much.
To find out why .outerHtml()
makes a difference, I did some digging in the source of pyquery.pyquery
.
def __str__(self):
encoding = str if PY3k else None
return ''.join([etree.tostring(e, encoding=encoding) for e in self])
...
@with_camel_case_alias
def outer_html(self, method="html"):
"""Get the html representation of the first selected element::
>>> d = PyQuery('<div><span class="red">toto</span> rocks</div>')
>>> print(d('span'))
<span class="red">toto</span> rocks
>>> print(d('span').outer_html())
<span class="red">toto</span>
>>> print(d('span').outerHtml())
<span class="red">toto</span>
>>> S = PyQuery('<p>Only <b>me</b> & myself</p>')
>>> print(S('b').outer_html())
<b>me</b>
..
"""
if not self:
return None
e0 = self[0]
if e0.tail:
e0 = deepcopy(e0)
e0.tail = ''
return etree.tostring(e0, encoding=text_type, method=method)
It seems etree.tostring(..., method='html')
in .outerHtml()
helped me get a valid XHTML. Then I consolidated my guessing with a looking into source of lxml
:
In [6]: help(etree.tostring)
Help on cython_function_or_method in module lxml.etree:
tostring(element_or_tree, *, encoding=None, method='xml', xml_declaration=None, pretty_print=False, with_tail=True, standalone=None, doctype=None, exclusive=False, with_comments=True, inclusive_ns_prefixes=None)
...
As you can see, lxml.etree.tostring()
convert objects as xml
strings by default. Is is better that we keep the content as what is parsed as, in my case I parsed the content as html
, when trying convert the content as strings in __str__()
?
Yeah the method should be preserved. As I remember it's hard because we use self.__class__()
a lot. Mean that you may have lost the original method when trying to print only an extracted part of the doc, not the whole doc.
@gawel I see it. Talking about the DOCTYPE stuff, I find something that may be helpful.
The Element
object from lxml
doesn't preserve the DOCTYPE and DTD content. If we need to keep them, we should parse the input as ElementTree
, and convert the ElementTree
object as strings will get the whole document.
The ElementTree class, doc of lxml
One of the important differences is that the
ElementTree
class serialises as a complete document, as opposed to a single Element. This includes top-level processing instructions and comments, as well as a DOCTYPE and other DTD content in the document
In [1]: from lxml import etree
In [2]: root = etree.XML('''\
...: ... <?xml version="1.0"?>
...: ... <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "parsnips"> ]>
...: ... <root>
...: ... <a>&tasty;</a>
...: ... </root>
...: ... ''')
In [3]: tree = etree.ElementTree(root)
In [4]: tree.docinfo.public_id = '-//W3C//DTD XHTML 1.0 Transitional//EN'
In [5]: >>> tree.docinfo.system_url = 'file://local.dtd'
In [6]: tree.docinfo.system_url = 'file://local.dtd'
In [8]: print(etree.tounicode(tree))
<!DOCTYPE root PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "file://local.dtd" [
<!ENTITY tasty "parsnips">
]>
<root>
<a>parsnips</a>
</root>
In [10]: result=tree.getroot()
In [11]: type(result)
Out[11]: lxml.etree._Element
In [12]: print(etree.tounicode(result))
<root>
<a>parsnips</a>
</root>
In [6]: from io import StringIO
In [7]: root = etree.parse(StringIO('''\
...: <?xml version="1.0"?>
...: <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "parsnips"> ]>
...: <root>
...: <a>&tasty;</a>
...: </root>
...: '''))
In [8]: type(root)
Out[8]: lxml.etree._ElementTree
In [10]: print(etree.tounicode(root))
<!DOCTYPE root SYSTEM "test" [
<!ENTITY tasty "parsnips">
]>
<root>
<a>parsnips</a>
</root>
pyquery already use etree https://github.com/gawel/pyquery/blob/master/pyquery/pyquery.py#L96
The problem is that html is not xml and most web pages fails
I know PyQuery is using etree
from lxml
to parse input. I'm trying to emphasize the difference between etree.parse()
and etree.fromstring()
, which could explain why <!DOCTYPE>
is lost in my case.
etree.parse()
accepts files or file-like objects as input, and converts the input as instances of ElementTree
. etree.fromstring()
accepts strings as input, and converts the input as instances of Element
. (ElementTree
and Element
are classes from lxml
). The ElementTree
keeps the <!DOCTYPE>
info, but the Element
ignores it.
In the source of pyquery.py
, files are parsed with etree.parse()
and an instance of ElementTree
is returned. Calling a .getroot()
on it returns the <html>
Element
. Hence the DOCTYPE is lost. https://github.com/gawel/pyquery/blob/master/pyquery/pyquery.py#L123
I see. Don't know why it use getroot. this is some pretty old code. probably here since the first commit :D
A problem occurred when I was parsing an XHML file from index page of pipenv doc: https://pipenv.readthedocs.io/en/latest/
Here's part of the codes I'm using:
Basically, I'm trying to remove the badges from the XHTML file with the help of
html
parser.The original header and
<!DOCTYPE>
are displayed below:After running the script once, badges were removed successfully. I noticed the
<!DOCTYPE>
was omitted byf.write(str(doc))
:The more confusing thing is, after running the script again (the 2nd time), of course there's no badges to be removed, the style of the XHTML file was changed once more.
<html>
attributexmlns
was omitted,<script></script>
tag was changed as<script/>
:Obviously, the 2nd running of the script made the XHTML invalid. I couldn't figure out what's wrong. Is this caused by the parser
paser='html'
, or my wrong use of pyquery to modify local XHTML file byf.write(str(doc))
?