freedomofpress / securethenews

An automated scanner and web dashboard for tracking TLS deployment across news organizations
https://securethe.news
GNU Affero General Public License v3.0
102 stars 25 forks source link

blacklight integration #273

Open redshiftzero opened 4 years ago

redshiftzero commented 4 years ago

The Markup released a tool last week for scanning sites for the use of ad trackers, third party cookies, key logging, session recording, among other technologies that privacy-conscious readers should know about. Links:

It would be worthwhile to integrate this tool into STN so that we can track the use of these technologies over time on major news sites. If we like this, there are a few questions to decide here:

Another question is how best to integrate the tool itself: to perform the scans we need node installed (we're currently using a container to do the scanning that doesn't). After initial discussions today, it seemed like proceeding with installing node in the container where scans are done is a reasonable/acceptable approach for this purpose, but noting here in case folks come up with other ideas.

eloquence commented 4 years ago

I would suggest starting by including the scan results in the detail view (e.g. https://securethe.news/sites/the-intercept) and the API. That gives us some time to live with the data and check for false positives/false negatives without immediately modifying the scores.

At least from the web UI it looks like Blacklight identifies some particularly egregious practices:

These may be good candidates for surfacing on the leaderboard in the near term. https://themarkup.org/blacklight?url=theintercept.com seems to employ its own scoring under the hood ("more than the average" number of trackers etc.) -- perhaps we could collaborate with them on a privacy score?

conorsch commented 4 years ago

I would suggest starting by including the scan results in the detail view [...] without immediately modifying the scores.

Agreed, that sounds like a modest investment, and allows us to add the integration with minimal commitment.

Another question is how best to integrate the tool itself

Regarding the architecture, I'll summarize out of band conversations with a few folks. The STN scanning code to date is all Python, and the Blacklight code is JS. We could try templating out JS files with the domain name hardcoded and evaluate that, then read in the file that was written to disk. It might be cleanest to bolt on a simple HTTP GET service the existing JS functionality, then poll that endpoint via the Python app code. That'd allow us to keep the Python & Node containers completely separate, and the Node container wouldn't need to be publicly accessible—it'd only be available to the app for local requests and responses.

There's still a bit of JS code to write to make that work, but the having the blacklight scanning logic separate from the bulk of the wagtail code sounds worth the effort. Might be worth pinging the Markup folks if we have trouble cobbling together that solution.

eloquence commented 4 years ago

@redshiftzero Checking in, are you planning to work on this in the near future / already working on it? If so, will add to the web board for visibility.

redshiftzero commented 4 years ago

I'm not actively working on this right now, but I'll assign myself if/when I do (looks like I have permissions to do that now)

conorsch commented 3 years ago

Had a chat with Surya & Simone at the Markup recently, and they magnanimously offered to let us poll their API directly for inclusion in STN, rather than bottle up the Blacklight scanning code and re-run scans for each website ourselves. That's certainly a far sight simpler than pulling in the code ourselves! Looking forward to stubbing out some endpoints locally, although the question of how to present findings on the site still leaves us a lot of options in terms of design.