Adding user agent for input url

leesei commented 12 years ago

I'm new to Python and glad to find this module to allow me to parse webpages. I would like suggest adding support for spoofing user agent for HTTP sources. Some webpage will return 401 when using urlopen(), e.g. http://www.google.com/patents/US5255452. Currently I'm using another Python (2.7) script to dump the output with user agent spoof for html2text:

    import urllib2
    request = urllib2.Request(url="http://www.google.com/patents/US5255452")
    # spoof user agent
    request.add_header("User-Agent", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko)")
    result = urllib2.urlopen(request)
    # write result .read() to file

aaronsw commented 12 years ago

You should be able to use install_opener to do this.

eadmaster commented 11 years ago

html2text is using urllib currently, so install_opener is not effective. my quick solution:

import urllib
urllib.URLopener.version = 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341 Safari/528.16'

import html2text
html2text.main()

leesei commented 11 years ago

Thanks both.

Actually I'm wondering why html2text uses urllib instead of urllib2? For backward compatibility reasons? (I'm not sure when is urllib2 added to python) I changed my copy to use urllib2 to specify the timeout value for the connection.

@aaronsw If I added an ua option, would you care to merge to the master branch ^^?

aaronsw / html2text

Adding user agent for input url #46