TF-PII Redaction regexes

loubnabnl commented 1 year ago

Test and update regexes for detecting the following entities in The Stack

[ ] Emails
[ ] IP addresses
[ ] SSH & API keys

liyongsea commented 1 year ago

Hi, this is update of the ssh key detection part:

The current ssh key regex from bigscience. https://regex101.com/r/uPld7m/1 This is working fine for key generated by ssh-keygen, but it is also giving too many false positive such as filepath and uuid.
git secret. It is also a regex based approach. Here is one of regex used https://regex101.com/r/GMVYDF/1. It is only detecting aws key. There is a 'naive' alternative here as well https://regex101.com/r/MD3IwT/1
https://github.com/Yelp/detect-secrets seems a aggregation of regex approach. We might want to test it as our pre annotation method. Here is the performance according to https://aoa0.github.io/pubs/icse22.pdf Table 3 (regex approach for detect-secret)

Todo:

test detect-secret, run it on a sample dataset and provide qualitative analysis
reach out to Runhan Feng et al to see if them are willing to provide help

liyongsea commented 1 year ago

Here is a PR WIP. If you want to play with the detect-secrets library https://github.com/bigcode-project/bigcode-analysis/pull/24/files

terryyz commented 1 year ago

Based on a question raised on StackOverflow and the description on Wikipedia, here is a summary of edge cases which current email regex method

can't detect:

printable characters =?
quoted local-part, such as "aaaa"@example.com
space and special characters "(),:;<>@[\] are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in that quoted string, any backslash or double-quote must be preceded once by a backslash)
comments are allowed with parentheses at either end of the local-part; e.g., john.smith(comment)@example.com and (comment)john.smith@example.com are both equivalent to john.smith@example.com
the domain may be an IP address literal, surrounded by square brackets [], such as jsmith@[192.168.2.1] or jsmith@[IPv6:2001:db8::1], although this is rarely seen except in email spam.
Comments are allowed in the domain as well as in the local-part; for example, john.smith@(comment)example.com and john.smith@example.com(comment) are equivalent to john.smith@example.com

wrongly detects:

dot ., provided that it is the first or last character or provided also that it appears consecutively, such as John..Doe@example.com, .JohnDoe@example.com, JohnDoe.@example.com
hyphen -, provided that it is the first or last character.

Let me know if you'd like to consider these cases.

loubnabnl commented 1 year ago

@liyongsea I looked at the docs for both git-secrets and detect-secrets. I’ve made summary of both in this document. I think they are really good starting points for keys detection TLDR:

Git-secrets has 3 regexes to detect AWS ID/key/Account, they use fixed prefixes to reduce false positives. We can add them to our regex list.
In this paper the authors detect AWS keys too with similar regexes but don’t use prefixes. To reduce false positives, they add a heuristic filter on top: remove only instances where a match for the Client ID and a match for the Secret Key appear within 5 lines of each other. They report Precision: 100% and Recall: 97%
Detect-secrets: it’s very interesting because they have many plugins (like AWSKeyDetector, AzureDetector..) which means we can test and select the best detectors, or take the regexes from each detector
- We might want to use the library though because it seems that when possible the library verifies the detected keys by sending them to the provider
- It also have 3 types of detectors: regex based and entropy/keyword based but I think the last two have more false positives (maybe that's why you find many false positives in the notebook in your PR)
- It also has this Gibberish Detector that we can apply on top of detected secrets to make sure they are not word-like

loubnabnl commented 1 year ago

@terryyz Would be happy to consider these cases if there’s a way to cover them without decreasing the accuracy of the regex. Although I think most of them are unlikely to be present in the dataset, e.g I didn’t know many of them were allowed in emails, not sure if most platforms accept this.

EDIT: Email regex used for BigScience was updated https://regex101.com/r/uRlGkP/1, we can test it against the current one, it seems more robust

loubnabnl commented 1 year ago

@liyongsea I made another notebook for the analysis of detect-secrets to follow up on what you started, see this PR. I run some tests on a dataset of 1k samples from different languages -from annotation round- with the default plugins, and then removed some of them, there are analysis paragraphs in the notebooks. What’s left to do:

Pipeline: setup one pipeline (preferably with consistent outputs) with
- the detect-secret for keys with the chosen plugins and filters in the notebook (or another combination if we do more analysis)
- add the email and IP address detection code
Tests and fixes:
- The email detection, let’s test this new regex from BigScience it seems more robust
- For the IP addresses, the current regex seems fine but it misses the case where there is a forward slash like in 10.0.0.0/24, let’s try to fix this
- For SSH keys we'll probably just use detect-secrets, but it would be interesting to compare against our old regex that was updated to ignore paths as it might have a higher recall(https://regex101.com/r/LMrTcZ/1) (we can even add an ignore uuid regex from detect-secrets)

liyongsea commented 1 year ago

@loubnabnl Thank you for the analysis ! I updated my PR to integrate your suggestion. I also clean up a bit and remove the argument suffix (in the beginning I thought it impacts the output....) There is one minor problem when you try to locate the secret with str.index. Sometimes, detect-secret could find something like password=password, it might cause problem when trying to redact. I will see how to improve

loubnabnl commented 1 year ago

I updated my PR to add the PII pipeline, now only the anonymization is missing.

I also added an analysis and evaluation notebook with some observations about our regexes' behavior (the evaluation is still WIP):

detect-secrets turned out to have a very low recall so I think we have to switch back to the regex, it still detects many paths and strings as keys but they can be removed with the gibberish detector, there are still somethings to improve though as I explain it here
email address also needs to be fixed, and I haven't analyzed IP adress yet

liyongsea commented 1 year ago

Thank you @loubnabnl ! after looking at the FN, I spot some quick wins for detect-secrets:

annotation mistake sample 44 60 81
repeating keys. It seems detect-secrets only provide the first occurence:

loubnabnl commented 1 year ago

Interesting, I'll take a deeper look at the results

bigcode-project / bigcode-dataset

TF-PII Redaction regexes #17