gawel / pyquery

A jquery-like library for python
http://pyquery.rtfd.org/
Other
2.3k stars 182 forks source link

PyQuery(url=_url) got a wrong html #60

Closed Lhfcws closed 10 years ago

Lhfcws commented 10 years ago

pyquery (1.2.8, installed from pip) , Python 2.7.5

My code is just like this:

from pyquery import PyQuery

pq = PyQuery(url=test_url)
print pq(".breadnav")

test_url1 = http://www.autohome.com.cn/265/ test_url2 = http://www.autohome.com.cn/2778/

test_url1 is fine, but test_url2 is totally wrong. In test_url2, when I printed pq.html(), I found that the latter part of script tag which was just above tag div with class="breadnav" was missing or trimmed. And this did not happen in test_url1.
If I use wget or browser or urllib to get the webpage source, I can get the correct html.
And test_url2 is not the only one wrong case.

Thank u ~ :)

gawel commented 10 years ago

I don't think that it's related to pyquery. lxml is used to parse the html/xml so it's probably a lxml problem. Try to reproduce the problem with lxml.(etree|html).parse first