Closed eloquence closed 5 years ago
We mentioned in meeting the strict approach of looking for loaded URLs and dinging the landing page for any cross domain request. It also occurrs to me that we could potentially lean on Privacy Badger's list if we wanted to limit our checks to known trackers.
... though looking through the cookie block list I'm not sure I actually am correctly understanding what that is a list of. There's a number of domains in there I wouldn't expect.
As far as I can tell it doesn't really matter from our point of view whether the third party asset is intended for tracking or not. The problem is that it increases the vulnerability surface for third parties to learn about source behavior. True, analytics scripts are designed to be intrusive and collect the maximum of data, but the mere presence of logs on third party servers is problematic.
Makes sense to me
OK, so the implementation options I see for this are:
naive check consistent with current approach: parse source attributes for scripts, images, etc. to detect resources apparently served from another domain than the landing page itself. Advantage: fairly straightforward and probably doable as part of initial launch. Disadvantage: could miss some cases where resources are loaded via scripts or edge cases in the source HTML.
shift to different loading approach within the scanner, e.g., use of Puppeteer or other headless browser automation that allows us to get a full log of all resources. Advantage: likely to catch most/all cases. Disadvantage: increases test time and possibly fragility as well; greater implementation complexity and long-term maintenance burden.
My suggestion would be to attempt to add an experimental naive check for now and see how well it works for real-world use cases. The combination of the existing analytics checks + additional HTML parsing may get us sufficiently close -- even when resources are loaded via JS, it's likely often to be done via a <script>
tag that we can detected in the page source. If this is too fragile, then I would recommend deferring this check until after the initial launch.
Thoughts?
Just to capture the discussion from this morning:
src
attributes, inline scripts, etc.) and ensures they don't differ from the landing page's primary domain name. We should design these checks to be robust for common third-level domain cases (e.g., newssite.co.uk
loading resources from analyticssite.co.uk
).images.nytimes.com
), it's unlikely that this would surface any new subdomain issues that our primary subdomain check doesn't already catch (e.g., innocentname.com
loading assets from leakallthethings.innocentname.com
), and more likely to lead to unnecessary warnings.As discussed this morning, here are a few example landing pages with third party assets to test with: [examples redacted]
Example output for the first site [example redacted]
The information is structured like a plain text tree, with top-level pages/files and assets within those files. The "normal" lines are the URLs of a page, a script, or a CSS file and the indented lines with *
bullets are assets contained in or requested by that asset.
See #557 for further explanation of what is checked for overall. [example redacted]
Check for scripts, styles, media, iframes, and other third party assets that should not be embedded in a landing page (i.e. are loaded from a different domain). Note that this initially only covers assets embedded in the HTML returned by the server, rather than ones dynamically loaded via JavaScript.
If such assets are found, it should trigger a severe warning (#496), but note above caveat regarding QA + outreach to news orgs.
We should only do this as part of the initial integration (#488) if we can decide on an implementation approach that's not overly burdensome and fragile.