elliotgao2 / gain

Web crawling framework based on asyncio.
GNU General Public License v3.0
2.04k stars 207 forks source link

Unescape html contains HTML Entities #23

Closed wisecsj closed 7 years ago

wisecsj commented 7 years ago

when the html fetched contains HTML Entities,pyquery would not work correctly .And that's why the pull request comes into being.

But,suprised,i find you did the same thing in the commit df8b4d7da5687e87334723be0834b0b1d6190530. I am confused that you delete that line in the commit e3ee18a732b638a64da228ca54a8db45bdb06be2 ,howerver. And you add url = unescape(url) because of the code parsers = [Parser('http://blog.sciencenet.cn/home.php\?mod=space&uid=\d+&do=blog&view=me&from=space&page=\d+'), Parser('blog\-\d+\-\d+\.html', Post)] contains HTML Entities like &amp.

So,i do confused why you did that.If unescape the whole html, not only pyquery would work fine,but also needn't to change parsers = [Parser('http://blog.sciencenet.cn/home.php\?mod=space&uid=\d+&do=blog&view=me&from=space&page=\d+'), to parsers = [Parser('http://blog.sciencenet.cn/home.php\?mod=space&uid=\d+&do=blog&view=me&from=space&page=\d+'), Parser('blog\-\d+\-\d+\.html', Post)] as we are used to write the former code.

As a undergraduate students ,Maybe there are some occasions i don't take into account or i'm wrong.

By the way,i opened an issue lists my problem.Could you help me out?

elliotgao2 commented 7 years ago

You are right. I should unescape the whole html.