duckduckgo / tracker-radar-collector

🕸 Modular, multithreaded, puppeteer-based crawler
Other
134 stars 49 forks source link

Unmatched responses due to new contexts and other causes #44

Closed gunesacar closed 3 years ago

gunesacar commented 3 years ago

Hi!

Going through a crawl data of Tranco top 10K sites, I found ~15-16% websites have at least one unmatched response. Almost half of these websites have some new contexts (e.g. service workers) initialized during the page visit.

Just to be clear, compared to total requests/responses captured by Tracker Radar Collector, the unmatched ones seem to be rare occurrences. Also, this may be an upstream issue with Puppeteer, but I couldn't figure out how to find out. I wanted to give a heads up anyways.

Here's the breakdown of unmatched responses based on new contexts initialized on websites:

Service workers

770 (of the ~10K) sites have at least one service worker context initiated. 660 (86%) of these have at least one unmatched response. The unmatched response seems to come immediately after the service worker context initialization. Spot checking the URLs they look like the requests to download the service worker script itself:

Reproducible on: pptr.dev, whatwg.org, coinbase.com

$ npm run crawl -- -u https://pptr.dev -o /tmp/ -v -f -d "requests" 
[...]
pptr.dev: requests init took 0.000s
pptr.dev: page context initiated in 0.002s
pptr.dev: https://pptr.dev/sw.js (service_worker) context initiated in 0.038s
pptr.dev: ⚠️ unmatched response 306390.16 https://pptr.dev/sw.js
pptr.dev: ⚠️ unmatched finished response {
  requestId: '306390.16',
  timestamp: 203383.850368,
  encodedDataLength: 0,
  shouldReportCorbBlocking: false
}
pptr.dev: ⚠️ failed to match 1 events
pptr.dev: getting requests data took 0.075s
Processing "https://pptr.dev" took 8.402s.
[...]
$ npm run crawl -- -u https://whatwg.org -o /tmp/ -v -f -d "requests" 
[...]
whatwg.org: requests init took 0.000s
whatwg.org: page context initiated in 0.001s
whatwg.org: https://whatwg.org/service-worker.js (service_worker) context initiated in 0.018s
whatwg.org: ⚠️ unmatched response 308294.12 https://whatwg.org/service-worker.js
whatwg.org: ⚠️ unmatched finished response {
  requestId: '308294.12',
  timestamp: 204695.719505,
  encodedDataLength: 0,
  shouldReportCorbBlocking: false
}
whatwg.org: ⚠️ failed to match 1 events
[...]

Web workers

81 sites have unmatched responses and (other) context initiated. For a few of these scripts I checked the devtools on a headful browser and verified that they are web worker scripts:

Reproducible on: fivethirtyeight.com, venturebeat.com, marieclaire.com, twitch.tv

$ npm run crawl -- -u fivethirtyeight.com  -o /tmp/ -v -f -d "requests"
[...]
fivethirtyeight.com: requests init took 0.000s
fivethirtyeight.com: page context initiated in 0.002s
fivethirtyeight.com: https://www.google.com/recaptcha/enterprise/webworker.js?hl=en&v=6g5J7UfDQ9mLrweZHj04ekSP (other) context initiated in 0.039s
fivethirtyeight.com: ⚠️ unmatched finished response {
  requestId: 'B527FE92D849C992C2B6AF34A730D11A',
  timestamp: 202060.395877,
  encodedDataLength: 91,
  shouldReportCorbBlocking: false
}
fivethirtyeight.com: ⚠️ failed to match 1 events
$ npm run crawl -- -u marieclaire.com -o /tmp/ -v -f -d "requests"
[...]
marieclaire.com: requests init took 0.000s
marieclaire.com: page context initiated in 0.002s
marieclaire.com: https://www.marieclaire.com/_sites/marieclaire/_next/static/chunks/logger.87a4d4503d32d0ee150d.worker.js (other) context initiated in 0.006s
marieclaire.com: ⚠️ unmatched response 304608.337 https://www.marieclaire.com/_sites/marieclaire/_next/static/chunks/logger.87a4d4503d32d0ee150d.worker.js
marieclaire.com: ⚠️ unmatched finished response {
  requestId: '304608.337',
  timestamp: 202498.410642,
  encodedDataLength: 1751,
  shouldReportCorbBlocking: false
}
marieclaire.com: ⚠️ failed to match 1 events
[...]

Popups

On a handful of sites unmatched responses seem to happen due to popup windows. To my surprise it turns out Puppeteer does not block (some?) popup windows as headful Chromium does. (Perhaps due to this open issue: https://github.com/puppeteer/puppeteer/issues/6161)

Reproducible on: naukri.com, see screenshots below:

Headful Chrome image

Tracker Radar Collector (with VISUAL_DEBUG=true) image

Popup windows are represented as (page) context in the logs:

$ npm run crawl -- -u https://naukri.com -o /tmp/ -v -f -d "requests" 

[...]
naukri.com: requests init took 0.000s
naukri.com: page context initiated in 0.002s
naukri.com: ⚠️ unmatched response 307251.188 https://company.naukri.com/popups/telus/19032021/telus-rs-250x250-19032021.gif
naukri.com: https://company.naukri.com/popups/telus/19032021/index.html (page) context initiated in 0.214s
naukri.com: https://company.naukri.com/popups/ptc/19032021/index.html (page) context initiated in 0.205s
naukri.com: https://company.naukri.com/popups/hsbc/3172020/index.html (page) context initiated in 0.200s
naukri.com: ⚠️ unmatched finished response {
  requestId: '307251.188',
  timestamp: 203887.817555,
  encodedDataLength: 42994,
  shouldReportCorbBlocking: false
}
naukri.com: ⚠️ unmatched response 307251.192 https://company.naukri.com/popups/ptc/19032021/ptc-rs-250x250-19032021.gif
naukri.com: ⚠️ unmatched finished response {
  requestId: '307251.192',
  timestamp: 203887.853638,
  encodedDataLength: 74131,
  shouldReportCorbBlocking: false
}
naukri.com: ⚠️ unmatched response 307251.196 https://company.naukri.com/popups/hsbc/3172020/hsbc-ns-250x250-2972020.gif
naukri.com: ⚠️ unmatched finished response {
  requestId: '307251.196',
  timestamp: 203887.854947,
  encodedDataLength: 7209,
  shouldReportCorbBlocking: false
}
[...]

Other (no new contexts)

345 sites has no popups or new contexts such as service_worker, shared_worker, (other), but they have unmatched responses.

Reproducible on: ca.gov, 9gag.com, bbc.co.uk

$ npm run crawl -- -u https://ca.gov -o /tmp/ -v -f -d "requests" 
[...]
Processing entry #1 (https://ca.gov).
ca.gov: requests init took 0.000s
ca.gov: page context initiated in 0.002s
ca.gov: ⚠️ unmatched finished response {
  requestId: '351DF65D65D9F65975BD0B92926C0FC9',
  timestamp: 203157.612804,
  encodedDataLength: 2687,
  shouldReportCorbBlocking: false
}
ca.gov: ⚠️ failed to match 1 events
[...]

I had the suspicion that attaching to targets may take a while for new contexts, during which some requests may be missed. But that doesn't explain why requests are missed on websites with no new contexts.

Anyhow, hope this helps. If you need any other info, or raw crawl data, just let me know.

kdzwinel commented 3 years ago

Thank you for such a detailed report!

46 fixes issues with service worker, web worker are some of the "other" issues. As you guessed, attaching takes time so I'm now forcing new contexts to wait for the debugger.

This didn't work for popups for some reason, I'll create a follow up task to look at popups separately.

In case of ca.gov there is no missing request, but rather we are seeing an "unmatched response" event for a script created dynamically by https://translate.googleapis.com/element/TE_20210224_00/e/js/element/element_main.js . I have no idea why this produces a network request. You can debug cases like that by fetching request body for them (Network.getResponseBody).

gunesacar commented 3 years ago

Thanks so much!

kdzwinel commented 3 years ago

Thanks again @gunesacar for reporting this! I added integration test that checks some of those edge cases (workers) - https://github.com/duckduckgo/tracker-radar-collector/blob/main/tests/integration/requestCollection.test.js and created a follow up to look into popups in #47 . Please let me know if I missed something.