Sotera / webpageclassifier

Categorizes a website given URL into one of blog|wiki|news|forum|classified|shopping|undecided.
Apache License 2.0
8 stars 3 forks source link

ERROR on craigslist.com #4

Closed ctwardy closed 7 years ago

ctwardy commented 7 years ago

Don't like custom user agents. If get 404, fake Mozilla or something and retry.

Update: Only Craigslist is failing now

ctwardy commented 7 years ago

Craigslist is having redirect issues: 301, 302, 301, 302, 302, then 404.

http://washingtondc.craigslist.org
*** Agent "MEMEX_PageClass_bot/0.5" failed. Retrying...
*** Agent "Mozilla/5.0" failed. Retrying...
*** Agent "Gecko/1.0" failed. Retrying...
*** Agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0" failed. Retrying...
*** Agent "Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like GeckoMozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" failed. Retrying...
    ERROR  : 404
    COOKIES: []
    HISTORY: [<Response [301]>]
    HEADERS: {'Pragma': 'no-cache', 'Expires': 'Thu, 01 Jan 1970 00:00:00 GMT', 'Strict-Transport-Security': 'max-age=86400; includeSubDomains', 'Content-Length': '1839', 'Date': 'Fri, 03 Mar 2017 21:44:33 GMT', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Last-Modified': 'Fri, 03 Mar 2017 21:44:33 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Cache-control': 'private', 'Server': 'Apache', 'X-Frame-Options': 'SAMEORIGIN'}
    RESPONSE: <!DOCTYPE html>
<html class="no-js">
    <head>
        <title>craigslist | Page Not Found</title>
  ...
    ---> ERROR <--- 
ctwardy commented 7 years ago

Amazon was having issues, but seems to like the full User-Agents and sleep. Also, note that rules say "news" but highest cosine score is for shopping. 👍


http://amazon.com
*** Agent "MEMEX_PageClass_bot/0.5" failed. Retrying...
*** Agent "Mozilla/5.0" failed. Retrying...
    [fo: 0.18, ne: 0.51, cl: 0.54, sh: 0.73]
    ---> news <--- 
ctwardy commented 7 years ago

SoteraDefense likes the User-Agent after Gecko.

http://soteradefense.com
*** Agent "MEMEX_PageClass_bot/0.5" failed. Retrying...
*** Agent "Mozilla/5.0" failed. Retrying...
*** Agent "Gecko/1.0" failed. Retrying...
    [fo: 0.00, ne: 0.00, cl: 0.00, sh: 0.00]
    ---> undecided <--- 

However, as all scores are 0 I wonder what HTML was returned....

ctwardy commented 7 years ago

Google takes all comers, and looks kinda' like classified or shopping. Interesting:

http://google.com
    [fo: 0.00, ne: 0.00, cl: 0.40, sh: 0.33]
    ---> undecided <--- 
ctwardy commented 7 years ago

Strangely, craigslist works fine when I try requests.get() from an iPython notebook. Same machine.

What's different about the script? User agent?

ctwardy commented 7 years ago

Closing this as #wontfix. Clever crawling is a job for SiteHound or such, not this program.