duckduckgo / tracker-radar-collector

🕸 Modular, multithreaded, puppeteer-based crawler
Other
134 stars 49 forks source link

`page.setViewport` causes the browser to disconnect in mobile emulation #43

Closed gunesacar closed 2 years ago

gunesacar commented 3 years ago

The -m, --mobile option seems to be causing tracker-radar-collector to fail during page load:

$ npm run crawl -- -u "https://duck.com" -o /tmp/ -v -f -d "requests" --mobile gives me:

Start time: Wed, 10 Mar 2021 10:58:18 GMT
Number of urls to crawl: 1
Number of crawlers: 1

Processing entry #1 (https://duck.com).
duck.com: requests init took 0.000s
duck.com: page context initiated in 0.002s
duck.com: Crawl failed net::ERR_ABORTED at https://duck.com/ Error: net::ERR_ABORTED at https://duck.com/
    at navigate (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at async FrameManager.navigateFrame (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:90:21)
    at async Frame.goto (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:416:16)
    at async Page.goto (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/Page.js:789:16)
    at async getSiteData (tracker-radar-collector/crawler.js:184:9)
duck.com: ⚠️ unmatched failed response [object Object]
duck.com: requests init took 0.000s
duck.com: page context initiated in 0.001s
duck.com: Crawl failed net::ERR_ABORTED at https://duck.com/ Error: net::ERR_ABORTED at https://duck.com/
    at navigate (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at async FrameManager.navigateFrame (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:90:21)
    at async Frame.goto (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:416:16)
    at async Page.goto (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/Page.js:789:16)
    at async getSiteData (tracker-radar-collector/crawler.js:184:9)
duck.com: ⚠️ unmatched failed response [object Object]
Max number of retries (2) exceeded for "https://duck.com".

✅ Finished successfully.
Finish time: Wed, 10 Mar 2021 10:58:18 GMT
Sucessful crawls: 0/1 (0.00%)
Failed crawls: 1/1 (100.00%)

The same crawl without the --mobile option runs just fine:

$ npm run crawl -- -u "https://duck.com" -o /tmp/ -v -f -d "requests"

Start time: Wed, 10 Mar 2021 10:58:38 GMT
Number of urls to crawl: 1
Number of crawlers: 1

Processing entry #1 (https://duck.com).
duck.com: requests init took 0.000s
duck.com: page context initiated in 0.001s
duck.com: getting requests data took 0.032s
Processing "https://duck.com" took 5.047s.

✅ Finished successfully.
Finish time: Wed, 10 Mar 2021 10:58:43 GMT
Sucessful crawls: 1/1 (100.00%)
Failed crawls: 0/1 (0.00%)

I and @asumansenol could reliably reproduce this error on a few different machines using the latest from the main branch.

The same error in a parallel crawl (e.g. with c=4) includes some error log about browser being disconnected.

Processing entry #3 (http://youtube.com).
facebook.com: page context initiated in 0.010s
facebook.com: Crawl failed net::ERR_ABORTED at http://facebook.com/ Error: net::ERR_ABORTED at http://facebook.com/
    at navigate (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at async FrameManager.navigateFrame (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:90:21)
    at async Frame.goto (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:416:16)
    at async Page.goto (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/Page.js:789:16)
    at async getSiteData (tracker-radar-collector/crawler.js:184:9)
facebook.com: ⚠️ unmatched failed response {
  requestId: 'C127D1C33C7C90FDB931CAFFCCAB1F85',
  timestamp: 10151.696633,
  type: 'Document',
  errorText: 'net::ERR_ABORTED',
  canceled: true
}
(node:19010) UnhandledPromiseRejectionWarning: Error: Navigation failed because browser has disconnected!
    at tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/LifecycleWatcher.js:51:147
    at tracker-radar-collector/node_modules/puppeteer/lib/cjs/vendor/mitt/src/index.js:51:62
    at Array.map (<anonymous>)
    at Object.emit (tracker-radar-collector/node_modules/puppeteer/lib/cjs/vendor/mitt/src/index.js:51:43)
    at CDPSession.emit (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/EventEmitter.js:72:22)
    at CDPSession._onClosed (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/Connection.js:247:14)
    at Connection._onMessage (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/common/Connection.js:94:25)
    at WebSocket.<anonymous> (tracker-radar-collector/node_modules/puppeteer/lib/cjs/puppeteer/node/NodeWebSocketTransport.js:13:32)
    at WebSocket.onMessage (tracker-radar-collector/node_modules/ws/lib/event-target.js:132:16)
    at WebSocket.emit (events.js:315:20)
(node:19010) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 4)

Since the page.setViewport call is one of the main differences between the desktop and the mobile crawl, I commented that line out and rerun a mobile crawl. I didn't get any errors!

Let me know if you need any other information from me to help you solve the issue.

gunesacar commented 3 years ago

Another data point: the problem goes away if I comment out isMobile and hasTouch from MOBILE_VIEWPORT, while keeping the page.setViewport call.

kdzwinel commented 3 years ago

Hey Gunes, thanks for the report! I gave it a quick look and it seems like an upstream issue (chromium/puppeteer) to me. The workaround is to set all mobile options on browser launch:

function openBrowser(log, proxyHost) {
    const args = {
        defaultViewport: MOBILE_VIEWPORT
    };

and comment out

    // page.setViewport(emulateMobile ? MOBILE_VIEWPORT : DEFAULT_VIEWPORT);

I'll land a proper fix at some point, but please use the workaround for now.

BTW Congrats on https://arxiv.org/pdf/2102.09301.pdf , well done 👏 Please feel to reach out to me directly (konrad at duckduckgo.com) if you'll have any thoughts about the crawler or would like to use Tracker Radar data in your research (we are crawling over 150k pages on regular basis and can adjust the crawler to collect more data if needed).

gunesacar commented 3 years ago

@kdzwinel Thanks so much for promptly addressing this. It makes sense that this is an upstream issue.

BTW Congrats on https://arxiv.org/pdf/2102.09301.pdf , well done clap

Thank you! Much of the credit goes to @ydimova. For the record, our experience using tracker-radar-collector for the study was just great. I especially appreciated how easy it is to add new instrumentation, since your method based on Runtime.evaluate is so generic (and novel). Also the tool is super easy to start with, and was quite stable handling tens of thousands of sites without any hiccups. I am certain that tracker-radar-collector will be a popular tool (along with OpenWPM) within the research community not long from now.

Please feel to reach out to me directly (konrad at duckduckgo.com) if you'll have any thoughts about the crawler or would like to use Tracker Radar data in your research (we are crawling over 150k pages on regular basis and can adjust the crawler to collect more data if needed).

I'll be more than happy to reach out. We have other projects that are based on tracker-radar-collector and I think it'd be useful to keep a channel open.