fecgov / fec-cms

The content management system (CMS) for the new Federal Election Commission website.
https://www.fec.gov
Other
93 stars 38 forks source link

Research task: Sanitize inputs in feedback tool before submission #1583

Open jenniferthibault opened 6 years ago

jenniferthibault commented 6 years ago

@jenniferthibault commented on Fri Jul 28 2017

To protect people who may not understand the feedback is public, sanitize inputs to make sure we aren't unintentionally accepting any personally identifiable information.

Such information might include:


@jenniferthibault commented on Fri Jul 28 2017

@laurenancona shared an approach she's used on a project for the City of Philadelphia

...a sanitizer script that ran between submission & posting to whatever data store. It used a regex to X-out anything resembling PII like SSI, phone, email, etc.

We used it in the context of another OSS product called Huginn, which is kind of like your own self-hosted IFTTT. I used Huginn to receive form submissions, sanitize any PII, and then route the submissions to one of several relevant Slack channels.

because it lives inside Huginn, it’s not committed to the public repo like most other things we did. https://github.com/huginn/huginn

I can ask someone to dump the most current version (it’s more intricate and loops back to look for SSN, etc) later, but it’s just JavaScript. Here’s the email filter:


Agent.receive = function() {

var self = this; var events = this.incomingEvents();

// emailPattern from http://stackoverflow.com/a/1373724/123776 var emailPattern = /[a-z0-9!#$%&'+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'+/=?^_`{|}~-]+)@(?:[a-z0-9](?:[a-z0-9-][a-z0-9])?.)+a-z0-9?/i;

events.forEach(function(event) { event.payload['q5_tellUs'] = event.payload['q5_tellUs'].replace(emailPattern, '@.***'); self.createEvent(event.payload); }); }



---

@xtine commented on [Fri Oct 20 2017](https://github.com/18F/fec-style/issues/751#issuecomment-338309750)

👋 @lbeaufort: instead of on the front-end, this might actually be better to sanitize from the back-end because we have to parse the data in Python before actually sending it to become a Github issue: https://github.com/18F/fec-cms/blob/develop/fec/data/views.py#L445

---

@lbeaufort commented on [Thu Oct 26 2017](https://github.com/18F/fec-style/issues/751#issuecomment-339835751)

Possibly use python library scrubadub? http://scrubadub.readthedocs.io/en/stable/index.html 

---

@lbeaufort commented on [Thu Oct 26 2017](https://github.com/18F/fec-style/issues/751#issuecomment-339837402)

@xtine would we want to remove all names or just last names? Sometimes names are helpful such as "feedback from Laura in RAD" etc.

Sample script w/scrubadub:

import scrubadub

text = "Jenny's SSN is 999-24-1232 and can be contacted at jennyjenny@fec.gov or (202) 867-5309."

scrubbed=scrubadub.clean(text)

print("text: {0}".format(text))
print("scrubbed: {0}".format(scrubbed))

---

@lbeaufort commented on [Tue Oct 31 2017](https://github.com/18F/fec-style/issues/751#issuecomment-340877079)

@xtine:

scrubadub currently supports removing:

Names (proper nouns) via textblob
Email addresses
URLs
Phone numbers via phonenumbers
username / password combinations
Skype usernames
Social security numbers

These [advanced techniques](http://scrubadub.readthedocs.io/en/stable/advanced_usage.html#advanced-usage) allow users to fine-tune the manner in which scrubadub cleans dirty dirty text.
xtine commented 6 years ago

Related WIP PR: https://github.com/18F/fec-cms/pull/1427

lbeaufort commented 5 years ago

Recent use case: a user posted their email In a feedback box issue. We should log the original and published a sanitized version