Safely removing 'index.html' from relative urls

shaunlebron commented 10 years ago

Fixes to #16

Added the PyQuery dependency. Required for safely replacing the urls on each of the generated pages.

martgnz commented 10 years ago

Please take a look at this.

axitkhurana commented 10 years ago

Hey @shaunlebron,

Thanks for sending the pull request. I tried using it, there seems to be an encoding issue here though: The character ’ converts to â. Can you please take a look?

shaunlebron commented 10 years ago

Couldn't recreate the problem. I pasted that character into the body of a ghost blog post. Buster spit out a file that preserved the character.

I'm on a Mac with Python 2.7.3. You?

axitkhurana commented 10 years ago

I'm using Python 2.7.6 on Mac. The character was in the heading in my case but I don't think that would make a difference.

axitkhurana commented 10 years ago

I tried the following:

In [1]: from pyquery import PyQuery

In [2]: text = '’'

In [3]: print text
’

In [4]: d = PyQuery(text, parser='html')

In [5]: d
Out[5]: [<p>]

In [6]: print d.__unicode__().encode('utf8')
<p>â</p>

In [7]: d.__unicode__().encode('utf8')
Out[7]: '<p>\xc3\xa2\xc2\x80\xc2\x99</p>'

Am I missing something?

axitkhurana commented 10 years ago

filetext = f.read().decode('utf8') should fix this I think. Let me check and get back.

shaunlebron commented 10 years ago

Strange. My repl produced the same results.

Maybe we have different wget versions which produce differently encoded html files.

Your decode solution worked for me though:

>>> text = "I’m going crazy"
>>> d = PyQuery(text, parser='html')
>>> print d.__unicode__().encode('utf8')
Iâm going crazy

>>> text = "I’m going crazy".decode('utf8')
>>> d = PyQuery(text, parser='html')
>>> print d.__unicode__().encode('utf8')
I’m going crazy

shaunlebron commented 10 years ago

added decode fix to this pull request

axitkhurana commented 10 years ago

Thanks! I have a feeling this might break with the wget version you were using. (I'm assuming your wget returns a UTF encoded file as it didn't break the apostrophe last time, I might be wrong here)

In [2]: text = u'’'

In [3]: d = PyQuery(text, parser='html')

In [4]: d.__unicode__().encode('utf8')
Out[4]: '<p>\xe2\x80\x99</p>'

In [5]: print d.__unicode__().encode('utf8')
<p>’</p>

In [6]: text.decode('utf8')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-6-57d99c085dcd> in <module>()
----> 1 text.decode('utf8')

Can you please try if buster is working in your case? If it's not, we should add a check for decoding only if the filetext is of type str.

shaunlebron commented 10 years ago

You're right. I should've checked that. It worked for some pages but it failed with test_page.html which I've committed to this pull request.

test_page.html is a wget-generated page, containing a blog post with title "I’m testing an apostrophe" and body "I’m writing a blog.".

Here is what my system yields.

Python 2.7.6 (v2.7.6:3a1db0d2747e, Nov 10 2013, 00:42:54) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyquery import PyQuery
>>> f = open('test_page.html')
>>> text = f.read()
>>> f.close()
>>>
>>> type(text)
<type 'str'>
>>> d = PyQuery(text, parser='html')
>>> print d.__unicode__().encode('utf8')
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title/><description/><link/>http://shaunlebron.com/<generator>Ghost v0.4.1</generator><lastbuilddate>Tue, 01 Apr 2014 13:38:59 GMT</lastbuilddate><link href="http://shaunlebron.com/rss/" rel="self" type="application/rss+xml"/><author/><ttl>60</ttl><item><title/><description>I’m writing a blog.]]&gt;</description><link/>http://shaunlebron.com/connection-through-writing/<guid ispermalink="false">9ef23b1e-fb86-43cb-882a-b2c813c8eac1</guid><creator/><pubdate>Thu, 13 Feb 2014 21:13:45 GMT</pubdate></item></channel></rss>
>>> 
>>> utext = text.decode('utf8')
>>> type(utext)
<type 'unicode'>
>>>
>>> d = PyQuery(utext, parser='html')
Traceback (most recent call last):  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pyquery/pyquery.py", line 226, in __init__
    elements = fromstring(context, self.parser)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pyquery/pyquery.py", line 90, in fromstring
    result = custom_parser(context)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/lxml/html/__init__.py", line 704, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/lxml/html/__init__.py", line 600, in document_fromstring    value = etree.fromstring(html, parser, **kw)  File "lxml.etree.pyx", line 3003, in lxml.etree.fromstring (src/lxml/lxml.etree.c:67277)  File "parser.pxi", line 1780, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:101591)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
>>>

I'll look at it again tonight. But see what your system does with test_page.html

martgnz commented 10 years ago

This fix is working? :)

ryan-endacott commented 10 years ago

:+1:

martgnz commented 10 years ago

installed, but gives me this error when I generate my blog:

Traceback (most recent call last): File "/usr/bin/buster", line 9, in load_entry_point('buster==0.1.2', 'console_scripts', 'buster')() File "build/bdist.linux-x86_64/egg/buster/buster.py", line 79, in main File "build/bdist.linux-x86_64/egg/buster/buster.py", line 66, in fixLinks TypeError: expected string or buffer

fluke commented 10 years ago

If this is working why hasn't it been pulled?

vitorbrandao commented 10 years ago

+1 for getting this feature into the next buster release.

lasconic commented 10 years ago

I'm using this in "production" at http://blog.lasconic.com. I encountered some problems though. PyQuery is replacing by which can cause some issues. Also, RSS feeds are not working currently, github doesn't serve them as rss since the extension if .html.

I made a new PR #35.

axitkhurana commented 9 years ago

Added. Thanks @lasconic @shaunlebron

axitkhurana / buster

Safely removing 'index.html' from relative urls #20