Closed shaunlebron closed 9 years ago
Please take a look at this.
Hey @shaunlebron,
Thanks for sending the pull request. I tried using it, there seems to be an encoding issue here though: The character ’
converts to â
. Can you please take a look?
Couldn't recreate the problem. I pasted that character into the body of a ghost blog post. Buster spit out a file that preserved the character.
I'm on a Mac with Python 2.7.3. You?
I'm using Python 2.7.6 on Mac. The character was in the heading in my case but I don't think that would make a difference.
I tried the following:
In [1]: from pyquery import PyQuery
In [2]: text = '’'
In [3]: print text
’
In [4]: d = PyQuery(text, parser='html')
In [5]: d
Out[5]: [<p>]
In [6]: print d.__unicode__().encode('utf8')
<p>â</p>
In [7]: d.__unicode__().encode('utf8')
Out[7]: '<p>\xc3\xa2\xc2\x80\xc2\x99</p>'
Am I missing something?
filetext = f.read().decode('utf8')
should fix this I think. Let me check and get back.
Strange. My repl produced the same results.
Maybe we have different wget versions which produce differently encoded html files.
Your decode solution worked for me though:
>>> text = "I’m going crazy"
>>> d = PyQuery(text, parser='html')
>>> print d.__unicode__().encode('utf8')
Iâm going crazy
>>> text = "I’m going crazy".decode('utf8')
>>> d = PyQuery(text, parser='html')
>>> print d.__unicode__().encode('utf8')
I’m going crazy
added decode fix to this pull request
Thanks! I have a feeling this might break with the wget version you were using. (I'm assuming your wget returns a UTF encoded file as it didn't break the apostrophe last time, I might be wrong here)
In [2]: text = u'’'
In [3]: d = PyQuery(text, parser='html')
In [4]: d.__unicode__().encode('utf8')
Out[4]: '<p>\xe2\x80\x99</p>'
In [5]: print d.__unicode__().encode('utf8')
<p>’</p>
In [6]: text.decode('utf8')
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-6-57d99c085dcd> in <module>()
----> 1 text.decode('utf8')
Can you please try if buster is working in your case? If it's not, we should add a check for decoding only if the filetext is of type str.
You're right. I should've checked that. It worked for some pages but it failed with test_page.html
which I've committed to this pull request.
test_page.html
is a wget-generated page, containing a blog post with title "I’m testing an apostrophe" and body "I’m writing a blog.".
Here is what my system yields.
Python 2.7.6 (v2.7.6:3a1db0d2747e, Nov 10 2013, 00:42:54)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyquery import PyQuery
>>> f = open('test_page.html')
>>> text = f.read()
>>> f.close()
>>>
>>> type(text)
<type 'str'>
>>> d = PyQuery(text, parser='html')
>>> print d.__unicode__().encode('utf8')
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title/><description/><link/>http://shaunlebron.com/<generator>Ghost v0.4.1</generator><lastbuilddate>Tue, 01 Apr 2014 13:38:59 GMT</lastbuilddate><link href="http://shaunlebron.com/rss/" rel="self" type="application/rss+xml"/><author/><ttl>60</ttl><item><title/><description>I’m writing a blog.]]></description><link/>http://shaunlebron.com/connection-through-writing/<guid ispermalink="false">9ef23b1e-fb86-43cb-882a-b2c813c8eac1</guid><creator/><pubdate>Thu, 13 Feb 2014 21:13:45 GMT</pubdate></item></channel></rss>
>>>
>>> utext = text.decode('utf8')
>>> type(utext)
<type 'unicode'>
>>>
>>> d = PyQuery(utext, parser='html')
Traceback (most recent call last): File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pyquery/pyquery.py", line 226, in __init__
elements = fromstring(context, self.parser)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pyquery/pyquery.py", line 90, in fromstring
result = custom_parser(context)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/lxml/html/__init__.py", line 704, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/lxml/html/__init__.py", line 600, in document_fromstring value = etree.fromstring(html, parser, **kw) File "lxml.etree.pyx", line 3003, in lxml.etree.fromstring (src/lxml/lxml.etree.c:67277) File "parser.pxi", line 1780, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:101591)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
>>>
I'll look at it again tonight. But see what your system does with test_page.html
This fix is working? :)
:+1:
installed, but gives me this error when I generate my blog:
Traceback (most recent call last):
File "/usr/bin/buster", line 9, in
If this is working why hasn't it been pulled?
+1 for getting this feature into the next buster release.
I'm using this in "production" at http://blog.lasconic.com.
I encountered some problems though. PyQuery is replacing
I made a new PR #35.
Added. Thanks @lasconic @shaunlebron
Fixes to #16
Added the PyQuery dependency. Required for safely replacing the urls on each of the generated pages.