Closed ctwardy closed 7 years ago
Craigslist is having redirect issues: 301, 302, 301, 302, 302, then 404.
http://washingtondc.craigslist.org
*** Agent "MEMEX_PageClass_bot/0.5" failed. Retrying...
*** Agent "Mozilla/5.0" failed. Retrying...
*** Agent "Gecko/1.0" failed. Retrying...
*** Agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0" failed. Retrying...
*** Agent "Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like GeckoMozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" failed. Retrying...
ERROR : 404
COOKIES: []
HISTORY: [<Response [301]>]
HEADERS: {'Pragma': 'no-cache', 'Expires': 'Thu, 01 Jan 1970 00:00:00 GMT', 'Strict-Transport-Security': 'max-age=86400; includeSubDomains', 'Content-Length': '1839', 'Date': 'Fri, 03 Mar 2017 21:44:33 GMT', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Last-Modified': 'Fri, 03 Mar 2017 21:44:33 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Cache-control': 'private', 'Server': 'Apache', 'X-Frame-Options': 'SAMEORIGIN'}
RESPONSE: <!DOCTYPE html>
<html class="no-js">
<head>
<title>craigslist | Page Not Found</title>
...
---> ERROR <---
Amazon was having issues, but seems to like the full User-Agents and sleep. Also, note that rules say "news" but highest cosine score is for shopping. 👍
http://amazon.com
*** Agent "MEMEX_PageClass_bot/0.5" failed. Retrying...
*** Agent "Mozilla/5.0" failed. Retrying...
[fo: 0.18, ne: 0.51, cl: 0.54, sh: 0.73]
---> news <---
SoteraDefense likes the User-Agent after Gecko.
http://soteradefense.com
*** Agent "MEMEX_PageClass_bot/0.5" failed. Retrying...
*** Agent "Mozilla/5.0" failed. Retrying...
*** Agent "Gecko/1.0" failed. Retrying...
[fo: 0.00, ne: 0.00, cl: 0.00, sh: 0.00]
---> undecided <---
However, as all scores are 0 I wonder what HTML was returned....
Google takes all comers, and looks kinda' like classified or shopping. Interesting:
http://google.com
[fo: 0.00, ne: 0.00, cl: 0.40, sh: 0.33]
---> undecided <---
Strangely, craigslist works fine when I try requests.get() from an iPython notebook. Same machine.
What's different about the script? User agent?
Closing this as #wontfix. Clever crawling is a job for SiteHound or such, not this program.
Don't like custom user agents. If get 404, fake Mozilla or something and retry.
Update: Only Craigslist is failing now