freedomofpress / securethenews

An automated scanner and web dashboard for tracking TLS deployment across news organizations
https://securethe.news
GNU Affero General Public License v3.0
100 stars 25 forks source link

Region and/or topic-based leaderboards #91

Closed garrettr closed 5 years ago

garrettr commented 7 years ago

We've been getting a lot of requests to add specific sites from other countries to the Secure the News leaderboard. As the leaderboard grows, it gets harder and harder to for humans to parse visually (although the search feature is still very handy). It would be nice to be able to filter the leaderboard by other criteria, such as country of origin or topic (e.g. IT news, finance, etc.).

At a minimum, we need to consider:

eloquence commented 7 years ago

Question: Do you anticipate operational capacity limits on the total number of sites securethe.news can support scanning within the current 8 hour intervals? Can we go from ~100 to 1,000 to 10,000 sites if needed? This may influence the design choices / inclusion criteria.

Suggestion: It might be best to strive for a tagging based system (where a tag could be a country-tag like "germany" or a topic tag like "finance" or an organization tag like "nonprofit"), and to put the top sites listed on the frontpage in a "top" tag ("top" could be decided based on estimated reach derived from a single, easy to query source like Quantcast traffic data).

Using an additional selector box, users would then be able to remove or add from that list with a tag search UI such as https://sean.is/poppin/tags or https://selectize.github.io/selectize.js/, which would be pre-populated with the "top" tag in the search box.

We could just update these from the CSV source once we have #110 resolved. If we maintain the input data as a single CSV, even a 10x to 100x increase wouldn't be unmanageable -- if the demo file size is an indication, even a lot of additional content would only get us into the 1MB order of magnitude. That still seems manageable via on-GitHub versioning/PRs/edits.

One major downside of tags is that they're a bit harder to internationalize. If full i18n/l10n is important, it may be useful to limit the set of permitted categories, rather than just allowing free-text in the CSV. But full i18n/l10n would be a pretty big project in and of itself.

garrettr commented 7 years ago

Do you anticipate operational capacity limits on the total number of sites securethe.news can support scanning within the current 8 hour intervals? Can we go from ~100 to 1,000 to 10,000 sites if needed? This may influence the design choices / inclusion criteria.

For the purposes of this issue, you should behave as if there is no limit on the scanning capacity of the site. This issue is focused on the UX challenges associated with adding more and more data to the site.

Brain dump for future reference: at a certain point, if we added enough sites, I can see there being two primary areas that would probably need improvement:

  1. The current scanning code is synchronous (one site is scanned at a time). As more sites are added, the total time to scan them all will increase, but that can be easily remedied by making the scanning code asynchronous so it can scan multiple sites at once.
  2. All of the site data is currently included in a single JSON blob (STNsiteData) which is embedded in the templates. If this blob gets very large it could have a noticeable negative impact on page load and parse times. However, I think we have quite a bit of headroom and response compression already helps a lot.
garrettr commented 7 years ago

Suggestion: It might be best to strive for a tagging based system (where a tag could be a country-tag like "germany" or a topic tag like "finance" or an organization tag like "nonprofit"), and to put the top sites listed on the frontpage in a "top" tag ("top" could be decided based on estimated reach derived from a single, easy to query source like Quantcast traffic data).

I think this is a reasonable suggestion and would certainly consider merging something like this.

My only reservation is that I think it's generally better, where possible, to associate data with specific categories (e.g. "country", "topic") instead of generic tags. This gives the relations clear and unambiguous meaning, makes defining requirements (e.g. a Site must have an associated Country) easier, makes validation easier, etc. The only downside is that you have devise a schema beforehand, whereas a generic tagging solution sort of allows you to improvise the categorization of your data over time.

That said, I'd be fine with either approach.

brierjon commented 6 years ago

Leveraging WikiData as proposed in #173 may provide help with both scaling the coverage as well as providing and maintaining generalizable metadata (country, subscription type, size, etc).