freedomofpress / securethenews

An automated scanner and web dashboard for tracking TLS deployment across news organizations
https://securethe.news
GNU Affero General Public License v3.0
102 stars 25 forks source link

Optimize database performance #84

Open garrettr opened 7 years ago

garrettr commented 7 years ago

Secure the News was very much designed as an "MVP," along with the mantra of "premature optimization is the root of all evil." As a result, querying Scans in the database is exceptionally slow. To see this slowness in action, load the site from the development environment and profile loading either the homepage or the leaderboard (both trigger expensive database operations).

The only reason why this slowness does not affect the production site is our extensive use of front-end caching (thanks, Cloudflare!).

There are some other issues with our database design as well. For example, we re-scan and store the full set of scan data for each site every 8 hours, even though the results of the scans change infrequently. It seems like it would be much more efficient to somehow only store the deltas of each scan, or only store a full set of scan data when the results change.

Finally, since the majority of the sites app's data is serialized into a relational database from the JSON output of our scanning tool pshtt, I've wondered in the past if using a database that is optimized for storing and querying JSON directly might be a better approach. During the initial development phase I wanted to avoid the complexity of using something other than the default Django ORM, so this idea was never explored in depth.

Finally: I am an amateur when it comes to database design, so I welcome any and all ideas here!

redshiftzero commented 7 years ago

Here's one approach that would alleviate some of the sluggishness from storing the full scan results every time a site is scanned:

  1. In the Scan model, add a column last_scanned
  2. When a site is scanned, first take the most recent scan for that site and see if the scan result has changed:
    • If the scan result has changed, insert a new row with last_scanned set to the timestamp of the scan.
    • If the scan result has not changed, do not insert a new row. Instead, update last_scanned to set it to the timestamp of the most recent scan.

This way you are tracking changes instead of recording every scan.

redshiftzero commented 7 years ago

So FYI I implemented this for the SecureDrop landing page scanner so I could scan all the landing pages once an hour, if at some future point the STN database gets too slow or we decide we want to scan more then we could use the same code here and do a migration