Region and/or topic-based leaderboards

garrettr commented 7 years ago

We've been getting a lot of requests to add specific sites from other countries to the Secure the News leaderboard. As the leaderboard grows, it gets harder and harder to for humans to parse visually (although the search feature is still very handy). It would be nice to be able to filter the leaderboard by other criteria, such as country of origin or topic (e.g. IT news, finance, etc.).

At a minimum, we need to consider:

Adding additional filtering criteria to the Site model (easy)
Devising a process to collect this data, and maintain it over time (medium).
Design a user-friendly way to filter the leaderboard based on the new criteria (medium). There are at least a few different options for how to do this:
- Keep a single leaderboard, and add optional filter controls to the frontend.
- Create separate leaderboard pages for specific countries, topics, etc.

eloquence commented 7 years ago

Question: Do you anticipate operational capacity limits on the total number of sites securethe.news can support scanning within the current 8 hour intervals? Can we go from ~100 to 1,000 to 10,000 sites if needed? This may influence the design choices / inclusion criteria.

Suggestion: It might be best to strive for a tagging based system (where a tag could be a country-tag like "germany" or a topic tag like "finance" or an organization tag like "nonprofit"), and to put the top sites listed on the frontpage in a "top" tag ("top" could be decided based on estimated reach derived from a single, easy to query source like Quantcast traffic data).

Using an additional selector box, users would then be able to remove or add from that list with a tag search UI such as https://sean.is/poppin/tags or https://selectize.github.io/selectize.js/, which would be pre-populated with the "top" tag in the search box.

We could just update these from the CSV source once we have #110 resolved. If we maintain the input data as a single CSV, even a 10x to 100x increase wouldn't be unmanageable -- if the demo file size is an indication, even a lot of additional content would only get us into the 1MB order of magnitude. That still seems manageable via on-GitHub versioning/PRs/edits.

One major downside of tags is that they're a bit harder to internationalize. If full i18n/l10n is important, it may be useful to limit the set of permitted categories, rather than just allowing free-text in the CSV. But full i18n/l10n would be a pretty big project in and of itself.

garrettr commented 7 years ago

Do you anticipate operational capacity limits on the total number of sites securethe.news can support scanning within the current 8 hour intervals? Can we go from ~100 to 1,000 to 10,000 sites if needed? This may influence the design choices / inclusion criteria.

For the purposes of this issue, you should behave as if there is no limit on the scanning capacity of the site. This issue is focused on the UX challenges associated with adding more and more data to the site.

Brain dump for future reference: at a certain point, if we added enough sites, I can see there being two primary areas that would probably need improvement:

The current scanning code is synchronous (one site is scanned at a time). As more sites are added, the total time to scan them all will increase, but that can be easily remedied by making the scanning code asynchronous so it can scan multiple sites at once.
All of the site data is currently included in a single JSON blob (STNsiteData) which is embedded in the templates. If this blob gets very large it could have a noticeable negative impact on page load and parse times. However, I think we have quite a bit of headroom and response compression already helps a lot.

garrettr commented 7 years ago

Suggestion: It might be best to strive for a tagging based system (where a tag could be a country-tag like "germany" or a topic tag like "finance" or an organization tag like "nonprofit"), and to put the top sites listed on the frontpage in a "top" tag ("top" could be decided based on estimated reach derived from a single, easy to query source like Quantcast traffic data).

I think this is a reasonable suggestion and would certainly consider merging something like this.

My only reservation is that I think it's generally better, where possible, to associate data with specific categories (e.g. "country", "topic") instead of generic tags. This gives the relations clear and unambiguous meaning, makes defining requirements (e.g. a Site must have an associated Country) easier, makes validation easier, etc. The only downside is that you have devise a schema beforehand, whereas a generic tagging solution sort of allows you to improvise the categorization of your data over time.

That said, I'd be fine with either approach.

brierjon commented 6 years ago

Leveraging WikiData as proposed in #173 may provide help with both scaling the coverage as well as providing and maintaining generalizable metadata (country, subscription type, size, etc).

freedomofpress / securethenews

Region and/or topic-based leaderboards #91