Check landing pages for assets on third party domain

freedomofpress / securedrop.org

Code for the SecureDrop project website

https://securedrop.org

GNU Affero General Public License v3.0

40 stars 9 forks source link

Check landing pages for assets on third party domain #506

Closed eloquence closed 5 years ago

eloquence commented 6 years ago

Check for scripts, styles, media, iframes, and other third party assets that should not be embedded in a landing page (i.e. are loaded from a different domain). Note that this initially only covers assets embedded in the HTML returned by the server, rather than ones dynamically loaded via JavaScript.

If such assets are found, it should trigger a severe warning (#496), but note above caveat regarding QA + outreach to news orgs.

We should only do this as part of the initial integration (#488) if we can decide on an implementation approach that's not overly burdensome and fragile.

harrislapiroff commented 5 years ago

We mentioned in meeting the strict approach of looking for loaded URLs and dinging the landing page for any cross domain request. It also occurrs to me that we could potentially lean on Privacy Badger's list if we wanted to limit our checks to known trackers.

harrislapiroff commented 5 years ago

... though looking through the cookie block list I'm not sure I actually am correctly understanding what that is a list of. There's a number of domains in there I wouldn't expect.

eloquence commented 5 years ago

As far as I can tell it doesn't really matter from our point of view whether the third party asset is intended for tracking or not. The problem is that it increases the vulnerability surface for third parties to learn about source behavior. True, analytics scripts are designed to be intrusive and collect the maximum of data, but the mere presence of logs on third party servers is problematic.

harrislapiroff commented 5 years ago

Makes sense to me

eloquence commented 5 years ago

OK, so the implementation options I see for this are:

naive check consistent with current approach: parse source attributes for scripts, images, etc. to detect resources apparently served from another domain than the landing page itself. Advantage: fairly straightforward and probably doable as part of initial launch. Disadvantage: could miss some cases where resources are loaded via scripts or edge cases in the source HTML.
shift to different loading approach within the scanner, e.g., use of Puppeteer or other headless browser automation that allows us to get a full log of all resources. Advantage: likely to catch most/all cases. Disadvantage: increases test time and possibly fragility as well; greater implementation complexity and long-term maintenance burden.

My suggestion would be to attempt to add an experimental naive check for now and see how well it works for real-world use cases. The combination of the existing analytics checks + additional HTML parsing may get us sufficiently close -- even when resources are loaded via JS, it's likely often to be done via a <script> tag that we can detected in the page source. If this is too fragile, then I would recommend deferring this check until after the initial launch.

Thoughts?

eloquence commented 5 years ago

Just to capture the discussion from this morning:

For now we'll go with a naive implementation that extracts apparent domain name references from various parts of the source (src attributes, inline scripts, etc.) and ensures they don't differ from the landing page's primary domain name. We should design these checks to be robust for common third-level domain cases (e.g., newssite.co.uk loading resources from analyticssite.co.uk).
I would argue we can tolerate resources being loaded from subdomains (e.g., images.nytimes.com), it's unlikely that this would surface any new subdomain issues that our primary subdomain check doesn't already catch (e.g., innocentname.com loading assets from leakallthethings.innocentname.com), and more likely to lead to unnecessary warnings.
Longer term, some headless browser automation may be desirable. We could route requests through a proxy that would block and log bad requests, so the actual logic could be fairly simple.

eloquence commented 5 years ago

As discussed this morning, here are a few example landing pages with third party assets to test with: [examples redacted]

chigby commented 5 years ago

Example output for the first site [example redacted]

The information is structured like a plain text tree, with top-level pages/files and assets within those files. The "normal" lines are the URLs of a page, a script, or a CSS file and the indented lines with * bullets are assets contained in or requested by that asset.

See #557 for further explanation of what is checked for overall. [example redacted]