When crawling, all domains appear to be DOWN

sunil3590 commented 4 years ago

ISSUE I tried to crawl a regular domain (not .onion) and the status fo the domain comes up as DOWN. I've tried this will multiple domains and even .onion domains but the result is the same, all domains are DOWN.

SETUP I have AIL, TOR, and Splash all installed and running on a single machine with one docker instance of Splash running on 8050 and Tor running on 9050

tcp        0      0 127.0.0.1:9050          0.0.0.0:*               LISTEN      18298/tor           
tcp6       0      0 :::8050                 :::*                    LISTEN      22611/docker-proxy

Logs from Splash Docker

2020-04-10 08:56:20.300419 [-] "X.X.X.X" - - [10/Apr/2020:08:56:19 +0000] "GET / HTTP/1.1" 200 7679 "-" "python-requests/2.22.0"
2020-04-10 08:56:20.859058 [render] [140342956635136] loadFinished: unknown error
2020-04-10 08:56:20.860248 [events] {"path": "/execute", "rendertime": 0.007615327835083008, "maxrss": 176844, "load": [0.05, 0.19, 0.18], "fds": 60, "active": 0, "qsize": 0, "_id": 140342956635136, "method": "POST", "timestamp": 1586508980, "user-agent": "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0", "args": {"cookies": [], "headers": {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0"}, "lua_source": "\nfunction main(splash, args)\n    -- Default values\n    splash.js_enabled = true\n    splash.private_mode_enabled = true\n    splash.images_enabled = true\n    splash.webgl_enabled = true\n    splash.media_source_enabled = true\n\n    -- Force enable things\n    splash.plugins_enabled = true\n    splash.request_body_enabled = true\n    splash.response_body_enabled = true\n\n    splash.indexeddb_enabled = true\n    splash.html5_media_enabled = true\n    splash.http2_enabled = true\n\n    -- User defined\n    splash.resource_timeout = args.resource_timeout\n    splash.timeout = args.timeout\n\n    -- Allow to pass cookies\n    splash:init_cookies(args.cookies)\n\n    -- Run\n    ok, reason = splash:go{args.url}\n    if not ok and not reason:find(\"http\") then\n        return {\n            error = reason,\n            last_url = splash:url()\n        }\n    end\n    if reason == \"http504\" then\n        splash:set_result_status_code(504)\n        return ''\n    end\n\n    splash:wait{args.wait}\n    -- Page instrumentation\n    -- splash.scroll_position = {y=1000}\n    splash:wait{args.wait}\n    -- Response\n    return {\n        har = splash:har(),\n        html = splash:html(),\n        png = splash:png{render_all=true},\n        cookies = splash:get_cookies(),\n        last_url = splash:url()\n    }\nend\n", "resource_timeout": 30, "timeout": 30, "url": "http://somedomain.onion", "wait": 10, "uid": 140342956635136}, "status_code": 200, "client_ip": "172.17.0.1"}
2020-04-10 08:56:20.860431 [-] "172.17.0.1" - - [10/Apr/2020:08:56:19 +0000] "POST /execute HTTP/1.1" 200 68 "-" "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0"

The line of code in Splash generating the error message above https://github.com/scrapinghub/splash/blob/9fda128b8485dd5f67eb103cd30df8f325a90bb0/splash/engines/webkit/browser_tab.py#L446

GaganBhat commented 3 years ago

Were you able to fix this? @sunil3590 Experiencing the same issue, Splash Down and all domains are down.

GaganBhat commented 3 years ago

@Terrtia I'm having a similar issue with Tor links where I get a "SPLASH DOWN" error but only with onion links.

Regular crawler however works.

TheFausap commented 3 years ago

Hello I have the same issue. Is there any update? thanks

TheFausap commented 3 years ago

Maybe I found the error in the screen logs (screen -r Crawlers_AIL):

 File "/opt/AIL/bin/torcrawler/TorSplashCrawler.py", line 181, in parse
    error_retry = request.meta.get('error_retry', 0)
NameError: name 'request' is not defined

matriceria commented 2 years ago

@TheFausap @Terrtia did you find the fix for this? i also cant crawl any onion domain since they appear to be down

CIRCL / AIL-framework

When crawling, all domains appear to be DOWN #490