gawel / pyquery

A jquery-like library for python
http://pyquery.rtfd.org/
Other
2.3k stars 182 forks source link

Use chardet to auto-detect encodings? #104

Closed Jorl17 closed 8 years ago

Jorl17 commented 9 years ago

Recently I ran with some issues when output wasn't in the expected encoding format (which I see is, by default, utf-8) in the html() method.

My application could not, a priori, know what the encoding was going to be, and this was a nuisance.

Perhaps using the chardet library could help alleviate this issue? It might also be the case that I simply did not correctly understand how html() works. If that was the case, then I'm sorry.

If indeed you think this fix should be applied, I can chip in and add it myself. Also, thanks for the awesome work!

gawel commented 9 years ago

I'm not sure that this is very usefull since 90% of users may use requests who already use chardet (or an alternative). So I dont really see the benefit. But if you need this, pull request with tests are always welcome.

Jorl17 commented 9 years ago

Perhaps I misused the library, but here's a snippet of "offending" code:

page = PyQuery ( url = ... )
page_html = page.html ()

Line 2 of this code blew up with encoding errors in etree (Python 3) on a Raspbian distro. This did not happen on a Mac OS X system. Unfortunately, I do not have the offending URL (and cannot find it again), but I think I fixed this by manually doing

f = urllib.request.urlopen(url = ...)
page_html = PyQuery(str(f.read()).encode('utf8')).html()

Which sounds hacky, dirty, and to be honest, since I can't test it, I'm not even sure if it's really correct. It was with this in mind that I thought of modifying PyQuery to include support for chardet. Do you think I was doing something wrong? Thanks!

gawel commented 9 years ago

Just try with requests installed. PyQuery will use it. See https://github.com/gawel/pyquery/blob/master/pyquery/openers.py#L69

And since requests' resp.text is unicode, everything should be fine.

gawel commented 8 years ago

You can also use PyQuery(url=my_url, encoding='utf-8')

Jorl17 commented 8 years ago

PyQuery(url=my_url, encoding='utf-8') will not work because the offending encoding is not utf-8 compatible. That is the issue here. Requests was not an option on our project. We resorted to using chardet internally.

Nevertheless, I understand why this is a "non-issue" and agree that it makes sense to close it