duckduckgo / tracker-radar-collector

🕸 Modular, multithreaded, puppeteer-based crawler
Other
135 stars 48 forks source link

Some requests made by popups are not recorded. #47

Open kdzwinel opened 3 years ago

kdzwinel commented 3 years ago

This was originally reported in #44 by @gunesacar:


On a handful of sites unmatched responses seem to happen due to popup windows. To my surprise it turns out Puppeteer does not block (some?) popup windows as headful Chromium does. (Perhaps due to this open issue: https://github.com/puppeteer/puppeteer/issues/6161)

Reproducible on: naukri.com, see screenshots below:

Headful Chrome image

Tracker Radar Collector (with VISUAL_DEBUG=true) image

Popup windows are represented as (page) context in the logs:

$ npm run crawl -- -u https://naukri.com -o /tmp/ -v -f -d "requests" 

[...]
naukri.com: requests init took 0.000s
naukri.com: page context initiated in 0.002s
naukri.com: ⚠️ unmatched response 307251.188 https://company.naukri.com/popups/telus/19032021/telus-rs-250x250-19032021.gif
naukri.com: https://company.naukri.com/popups/telus/19032021/index.html (page) context initiated in 0.214s
naukri.com: https://company.naukri.com/popups/ptc/19032021/index.html (page) context initiated in 0.205s
naukri.com: https://company.naukri.com/popups/hsbc/3172020/index.html (page) context initiated in 0.200s
naukri.com: ⚠️ unmatched finished response {
  requestId: '307251.188',
  timestamp: 203887.817555,
  encodedDataLength: 42994,
  shouldReportCorbBlocking: false
}
naukri.com: ⚠️ unmatched response 307251.192 https://company.naukri.com/popups/ptc/19032021/ptc-rs-250x250-19032021.gif
naukri.com: ⚠️ unmatched finished response {
  requestId: '307251.192',
  timestamp: 203887.853638,
  encodedDataLength: 74131,
  shouldReportCorbBlocking: false
}
naukri.com: ⚠️ unmatched response 307251.196 https://company.naukri.com/popups/hsbc/3172020/hsbc-ns-250x250-2972020.gif
naukri.com: ⚠️ unmatched finished response {
  requestId: '307251.196',
  timestamp: 203887.854947,
  encodedDataLength: 7209,
  shouldReportCorbBlocking: false
}
[...]
stevenwdv commented 2 years ago

For clarity: This also applies to websites opening new tabs, e.g. window.open('#', '_blank') or <a href=# target=_blank>x</a>. The initial request(s?) will not be captured.