This is an application for gathering responses from confidential surveys in a way that doesn't result in a large table of sensitive records.
The basic idea is to not store individual form responses as records but instead only use the survey response just to increment the appropriate counters. This allows us to derive the statistics we want to ultimately measure without assembling a large database of private responses. This principle of collecting only the minimum amount of information is also known as Datensparsamkeit, which is just a cool word to say.
So, if we had a survey on ice cream and we wanted to ask employees:
And so on, we could classify the types of questions here among several distinct types to start with:
combination
if
they pick more than one.A survey about ice cream is admittedly a dumb example. It's something you could create with an existing public service like SurveyMonkey or Google Forms. Imagine however that we wanted to ask questions about something more confidential like employee diversity or sexual orientation. These systems all collect individual responses as database records or rows in a spreadsheet. While they are probably secure, why do I need this detailed information if I am only going to generate summary statistics anyway? Individual responses might be anonymous, but may endanger a respondent's privacy when combined together in a query. Why should I be asking people to trust me that nobody will use these records to drill down and do something awful like count how many LGBT people are in the accounting department of the NYC office? What if the data collection only allowed for pre-approved interpretations?
This program is written to automatically preserve privacy by discarding survey submissions and using them just to increment counters like this
Survey: ice-cream
If we wanted to also drill down on the intersections between two fields, we could specify that in a configuration in advance (this system is designed to prevent such analysis after the fact)
Be careful: This functionality is meant for very broad intersections
like engineering/non-engineering
AND gender
for
instance. Finer-grained intersections that span many fields and result
in only a few responses could harm the privacy of individuals.
This program has the following components:
The survey application is written as a Ruby on Rails application running on Ruby 2.3.0. Most of its libraries are available as gems that can be installed by bundler. It does use Postgresql as its database, so you will need to have that installed.
To get a local copy running
git clone git@github.com:18F/confidential-survey.git
cd confidential-survey
bundle install
bundle exec rake db:setup
bundle exec rails server
export SURVEY_ADMIN_NAME=debug
export SURVEY_ADMIN_PASSWORD=debug
Then you can go to http://localhost:3000/survey/sample-survey and you should see a survey you can fill out. If you visit an administrator-protected route, it should prompt you for the username and password set above.
bundle exec rake
should execute the tests. All tests are written in RSpec
This application is deployed on the cloud.gov PaaS which runs on Cloud Foundry. The following instructions are 18F-specific, but could easily be adapted for other Cloud Foundry instances or other web hosts.
Create the app (it's ok if the deploy fails):
cf push survey
Create the database service:
cf create-service rds shared-psql survey-psql
Set environment variables with cf set-env
:
cf set-env survey SURVEY_ADMIN_NAME [username]
cf set-env survey SURVEY_ADMIN_PASSWORD: [password]
The application is currently secured in production with blanket HTTP Authentication, so you will need to set its username and password. These will also need to be set to run the app in cf ssh so we have to set this twice.
Set up the database:
cf-ssh
bundle exec rake db:migrate
bundle exec rake db:seed
Restage the app:
cf restage survey
To deploy future releases:
cf push survey
Surveys are implemented as YAML configuration files within the
config/surveys
directory of the application (here is
a sample survey included in the repo). Surveys
do not need to be – and probably should not be – checked into the
repo.
config/surveys
) must be deployed to production. This limits the
ability to create/edit surveys on the system only to the lead
developer or anybody else with deploy access to the specific
space. If the survey is named SURVEY_NAME.yml
, the new survey
form is accessible at /surveys/SURVEY_NAME
inactive
– meaning that it no longer
accepts responses – the developer has to edit a field in the
survey's YAML configuration to be active: false
and redeploy the
survey.The survey name is used to key all tallies for its responses in the system. This means that changing the survey name/URL will reset all its tallies to 0 unless you rename all the old rows to use the new ID.
The survey application supports two different modes of securing access:
Neither of these schemes are meant to identify specific users for a survey. The goal of these tools is merely to limit access to surveys so that they can be taken only by people who are supposed to take the survey.
The token scheme requires the survey administrators to generate a pool of tokens for the survey. These can then be distributed out to survey participants. It is best that whoever is doing this distribution does not retain a list of which tokens are sent to which users, since that information could potentially be used by someone with database access to identify people who have not taken the survey.
To generate tokens, an administrator can send a GET or POST request to
/surveys/SURVEY-NAME/token
and this will generate a token linked to
the survey and return a URL that can be given to a single user for
taking the survey. This endpoint can be called to return a batch of
tokens by appending a n=
argument to the request. Here is an example
of calling it on a development instance running on localhost.
curl --user ${SURVEY_ADMIN_USER}:${SURVEY_ADMIN_PASSWORD} http://localhost:3000/surveys/sample-survey/token\?n\=10
http://localhost:3000/surveys/sample-survey?token=z9OJSmzFZcKWDpXlnt1LPA
http://localhost:3000/surveys/sample-survey?token=wE-gRGcI0ayHH3Q8qW5MtA
http://localhost:3000/surveys/sample-survey?token=Hi59JzRPbXOAN9Mu2876sg
http://localhost:3000/surveys/sample-survey?token=FU7bwF29kKqcV-27lAIfCQ
http://localhost:3000/surveys/sample-survey?token=Wm-pvsfkr20y-pGALiYjuw
http://localhost:3000/surveys/sample-survey?token=FmOml8wTKJo7mHAjf_8y8A
http://localhost:3000/surveys/sample-survey?token=xKquRdHvi0YpJ2iADxpZpw
http://localhost:3000/surveys/sample-survey?token=PHPd_SW5i-AzZaIUscl13w
http://localhost:3000/surveys/sample-survey?token=iqQPTzQ21pdEaKjROb6Ozw
http://localhost:3000/surveys/sample-survey?token=C7Zg2J_1nyFpW-dWms-gNQ
Once a user uses this URL to fill out the survey, the token will be revoked and the URL will not work again. This means that the same URL should not be given to several users. The token is only used for access and does not identify a respondent in any way. There is no issue with generating many extra tokens that aren't used, and tokens can be generated at any time when a survey is active. To close access to a survey, all tokens can be revoked by an administrator.
curl --user ${SURVEY_ADMIN_USER}:${SURVEY_ADMIN_PASSWORD} http://localhost:3000/surveys/sample-survey/revoke
Tokens are generated by the SurveyToken
model using Ruby's
SecureRandom
class for generating random tokens using system
libraries for randomness and entropy. Currently, each token is a
16-byte random number meaning there is a 1 in 3.40282367x10^38 chance
of guessing a token. All of this does assume the SecureRandom
library has no issues that weaken random number generation.
Alternatively, you can specify that the tool should use blanket HTTP authentication to protect the survey form. This requires you to add 2-3 fields to the survey YAML to indicate that you want to use HTTP authentication:
access:
type: http_auth
user: <username>
password: <password>
This will then require HTTP authentication for users to access / submit the surveys. There are a few caveats to this approach:
active: false
and redeployed to disable HTTP
auth-protected surveys since it does not rely on access tokensThis program is written to minimize the amount of information collected to help preserve the anonymity of respondents, but I can not explicitly guarantee that respondents will always be anonymous. There are a few ways in which anonymity could potentially be compromised:
revoke_tokens
request. For this reason, whomever is
distributing the tokens should ideally not keep a list of who has
what tokens at all, and should not share any information with an
administrator who has access to the database.The application will set a session cookie, which seems like something that will undermine the promises of anonymity. Unfortunately, I need to use that cookie for Rails' protection against Cross-Site Request Forgery (CSRF) with the form. Rails' form classes provide that protection automatically. The survey application emphatically does not use the session cookie for storing/retrieving any other information or any other cookies.
This repository uses two tools to provide a total of three types of automated security checks:
All security scans are built into the test suite. bundle exec rake spec
will run them. To run the security scans ad hoc:
Brakeman:
bundle exec brakeman
Hakiri for Ruby/Rails versions:
bundle exec hakiri system:scan -m hakiri_manifest.json
Hakiri for Gemfile dependency versions:
bundle exec hakiri gemfile:scan
Sometimes Brakeman will report a false positive. In cases like these, the warnings will be ignored. Ignored warnings are declared in config/brakeman.ignore
. This file contains a machine-readable list of all ignored warnings. Any ignored warning will contain a note explaining (or linking to an explanation of) why the warning is ignored.
This project is in the worldwide public domain. As stated in CONTRIBUTING:
This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.
All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.