duckduckgo / tracker-radar-collector

🕸 Modular, multithreaded, puppeteer-based crawler
Other
133 stars 49 forks source link

Early browser API accesses and function calls are missed #77

Open asumansenol opened 1 year ago

asumansenol commented 1 year ago

Hi! While running some pilot crawls for our current study, we found that the TRC doesn’t collect function calls or access to properties when the call/access occurs immediately after page load. Perhaps APICallCollector can’t find time to register the breakpoints. To test this issue, we have created two test pages that

  1. Access window.devicePixelRatio
  2. Call toDataURL method of an HTML5 canvas element

We’ve visited the test pages using the latest version of TRC without any modification.

  1. Test page 1: The script is run 1000ms after the page load.

I hope this helps. If you need any other info, just let me know.

kdzwinel commented 1 year ago

Hey @asumansenol , thanks for bringing this up!

I observed the same with our API collection integration test -> https://github.com/duckduckgo/tracker-radar-collector/blob/main/tests/integration/apiCollection.test.js . Which is somehow flaky because of this issue.

I suspect a race condition between API collection script setting things up (https://github.com/duckduckgo/tracker-radar-collector/blob/main/collectors/APICalls/TrackerTracker.js#L126) and scripts on the page alrady running.

This is not a huge issue for DDG use case as everything is ready before 3p request load and execute in most cases, plus we operate on a huge sample of sites, but I can see how this is not precise enough for other use cases.

I suspect this is fixable - I'll give it a shot next week and let you know.

kdzwinel commented 1 year ago

Sorry, still no solution to this. @muodov is updating APICollector for a better attribution (https://github.com/duckduckgo/tracker-radar-collector/pull/90), but it doesn't seem to have an effect on this issue. I suspect the solution here is to block scripts from running before all collectors are fully set up. This can be done e.g. via Debugger.pause as soon as page starts loading.

muodov commented 1 year ago

There seems to be a problem with RequestCollector and latest chromium as well, I'm currently investigating, but don't have a concrete solution yet

muodov commented 1 year ago

I think this is basically the same problem as described in https://github.com/puppeteer/puppeteer/issues/8507. This was fixed in puppeteer last year, but unfortunately it is incompatible with our current CDP usage, as I mentioned in https://github.com/duckduckgo/tracker-radar-collector/issues/84#issuecomment-1452230159. We're exploring different options to fix this at the moment.