aaronsw / html2text

Convert HTML to Markdown-formatted text.
http://www.aaronsw.com/2002/html2text/
GNU General Public License v3.0
2.63k stars 414 forks source link

Adding user agent for input url #46

Open leesei opened 12 years ago

leesei commented 12 years ago

I'm new to Python and glad to find this module to allow me to parse webpages. I would like suggest adding support for spoofing user agent for HTTP sources. Some webpage will return 401 when using urlopen(), e.g. http://www.google.com/patents/US5255452. Currently I'm using another Python (2.7) script to dump the output with user agent spoof for html2text:

    import urllib2
    request = urllib2.Request(url="http://www.google.com/patents/US5255452")
    # spoof user agent
    request.add_header("User-Agent", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko)")
    result = urllib2.urlopen(request)
    # write result .read() to file
aaronsw commented 12 years ago

You should be able to use install_opener to do this.

eadmaster commented 11 years ago

html2text is using urllib currently, so install_opener is not effective. my quick solution:

import urllib
urllib.URLopener.version = 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341 Safari/528.16'

import html2text
html2text.main()
leesei commented 11 years ago

Thanks both.

Actually I'm wondering why html2text uses urllib instead of urllib2? For backward compatibility reasons? (I'm not sure when is urllib2 added to python) I changed my copy to use urllib2 to specify the timeout value for the connection.

@aaronsw If I added an ua option, would you care to merge to the master branch ^^?