test_url1 is fine, but test_url2 is totally wrong. In test_url2, when I printed pq.html(), I found that the latter part of script tag which was just above tag div with class="breadnav" was missing or trimmed. And this did not happen in test_url1.
If I use wget or browser or urllib to get the webpage source, I can get the correct html.
And test_url2 is not the only one wrong case.
I don't think that it's related to pyquery. lxml is used to parse the html/xml so it's probably a lxml problem. Try to reproduce the problem with lxml.(etree|html).parse first
My code is just like this:
test_url1 is fine, but test_url2 is totally wrong. In test_url2, when I printed
pq.html()
, I found that the latter part ofscript
tag which was just above tagdiv
withclass="breadnav"
was missing or trimmed. And this did not happen in test_url1.If I use
wget
or browser orurllib
to get the webpage source, I can get the correct html.And test_url2 is not the only one wrong case.
Thank u ~ :)