Police-Data-Accessibility-Project / data-sources-app

An API and UI for using and maintaining the Data Sources database
MIT License
3 stars 5 forks source link

Security Bug: Sensitive Data in Stage Database Can Be Exposed through Github Actions #378

Open maxachis opened 2 months ago

maxachis commented 2 months ago

While I took efforts to avoid exposing sensitive data in the Sandbox database, there are ways to expose sensitive data in the stage database.

The hole is through Github Actions.

As designed, the Stage Environment is meant to be interfaced with through Github Actions, which will run tests that interface with the database. If someone wanted to expose sensitive data, they could develop tests which print the contents of rows of the database, run that in Github Actions, and thereby see that data printed in the Github Actions log.

Note that passwords, being encrypted, would still have some security if this were exploited. API keys, user emails, and information on requests, however, would not. And additionally, our logic is such that a person could decrypt the passwords (using functions we already have in our code) and then print those to the log.

There are a few possible ways to address this:

  1. Simply remove the sensitive data in the stage database and replace it with fake data, as I do with Sandbox. Easiest to do, but does mean that the stage database has less resemblance to production in terms of the amount of data.
  2. Scramble or otherwise anonymize the user data in stage. This would be a more involved process that involves targeting the columns identified as sensitive.
  3. Come up with some way to protect database-interfacing tests, so that even if they work on personal data, they wouldn't expose it. This might be as simple as having it so that test results aren't logged. The downside to this is that, for innocent uses of the tests, this makes them harder to perform.
  4. Encrypt the data in the database in some form or fashion. This might be a longer term solution, but would probably be the best solution, and most likely what we should be doing anyway.

For a quick and dirty solution, 1 would work. If we're worried about verisimilitude, we can make our fake data generation more sophisticated and add more data.

4 is probably good for the long-term, but has a lot of unknowns that would need worked out.

josh-chamberlain commented 2 months ago

Hmm. Who has authority to write and run github actions? That limits the risk somewhat but could still go wrong.

maxachis commented 2 months ago

Hmm. Who has authority to write and run github actions? That limits the risk somewhat but could still go wrong.

In this case, the issue is not with the Github Actions themselves, but with the tests which are run by the Github actions. A person could create a test called print the unencrypted password of every user in the database, and Github Actions would run that test. Even if we limited who can write and run Github actions, as long as the Github action can run something whose level of access we aren't restricting, the hole can be exploited.

josh-chamberlain commented 2 months ago

OK, I think we should 4. encrypt it! We can use this as a reference: https://docs.pdap.io/activities/data-dictionaries/hidden-properties

maxachis commented 2 months ago

@josh-chamberlain Can do!

388 Can serve as a case study in how to encrypt necessary data. The current plan is to use an encryption key, provided as an environment variable, which trusted sources use to encrypt and decrypt the data. Then, for Github Actions, we merely need to give the action a different encryption key.

From there, we can expand to additional hidden properties.

There are likely also ways to lock down GitHub Actions, most prominently by restricting who can make a pull request, or by moving some tests off-site. How far we want to go with that is up for debate.

This is probably a good reference to consult for security hardening.