Closed loubnabnl closed 1 year ago
Hi, this is update of the ssh key detection part:
The current ssh key regex from bigscience. https://regex101.com/r/uPld7m/1 This is working fine for key generated by ssh-keygen
, but it is also giving too many false positive such as filepath and uuid.
git secret. It is also a regex based approach. Here is one of regex used https://regex101.com/r/GMVYDF/1. It is only detecting aws key. There is a 'naive' alternative here as well https://regex101.com/r/MD3IwT/1
https://github.com/Yelp/detect-secrets seems a aggregation of regex approach. We might want to test it as our pre annotation method. Here is the performance according to https://aoa0.github.io/pubs/icse22.pdf Table 3 (regex approach for detect-secret)
Todo:
Here is a PR WIP. If you want to play with the detect-secrets library https://github.com/bigcode-project/bigcode-analysis/pull/24/files
Based on a question raised on StackOverflow and the description on Wikipedia, here is a summary of edge cases which current email regex method
can't detect:
=?
"aaaa"@example.com
"(),:;<>@[\]
are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in that quoted string, any backslash or double-quote must be preceded once by a backslash)john.smith(comment)@example.com
and (comment)john.smith@example.com
are both equivalent to john.smith@example.com
[]
, such as jsmith@[192.168.2.1]
or jsmith@[IPv6:2001:db8::1]
, although this is rarely seen except in email spam.john.smith@(comment)example.com
and john.smith@example.com(comment)
are equivalent to john.smith@example.com
wrongly detects:
.
, provided that it is the first or last character or provided also that it appears consecutively, such as John..Doe@example.com
, .JohnDoe@example.com
, JohnDoe.@example.com
-
, provided that it is the first or last character.Let me know if you'd like to consider these cases.
@liyongsea I looked at the docs for both git-secrets
and detect-secrets
. I’ve made summary of both in this document. I think they are really good starting points for keys detection
TLDR:
Git-secrets has 3 regexes to detect AWS ID/key/Account, they use fixed prefixes to reduce false positives. We can add them to our regex list.
In this paper the authors detect AWS keys too with similar regexes but don’t use prefixes. To reduce false positives, they add a heuristic filter on top: remove only instances where a match for the Client ID and a match for the Secret Key appear within 5 lines of each other. They report Precision: 100% and Recall: 97%
Detect-secrets: it’s very interesting because they have many plugins (like AWSKeyDetector, AzureDetector..) which means we can test and select the best detectors, or take the regexes from each detector
@terryyz Would be happy to consider these cases if there’s a way to cover them without decreasing the accuracy of the regex. Although I think most of them are unlikely to be present in the dataset, e.g I didn’t know many of them were allowed in emails, not sure if most platforms accept this.
EDIT: Email regex used for BigScience was updated https://regex101.com/r/uRlGkP/1, we can test it against the current one, it seems more robust
@liyongsea I made another notebook for the analysis of detect-secrets to follow up on what you started, see this PR. I run some tests on a dataset of 1k samples from different languages -from annotation round- with the default plugins, and then removed some of them, there are analysis paragraphs in the notebooks. What’s left to do:
Pipeline: setup one pipeline (preferably with consistent outputs) with
Tests and fixes:
@loubnabnl Thank you for the analysis ! I updated my PR to integrate your suggestion. I also clean up a bit and remove the argument suffix (in the beginning I thought it impacts the output....)
There is one minor problem when you try to locate the secret with str.index. Sometimes, detect-secret could find something like password=password
, it might cause problem when trying to redact. I will see how to improve
I updated my PR to add the PII pipeline, now only the anonymization is missing.
I also added an analysis and evaluation notebook with some observations about our regexes' behavior (the evaluation is still WIP):
Thank you @loubnabnl ! after looking at the FN, I spot some quick wins for detect-secrets:
sample 44 60 81
Interesting, I'll take a deeper look at the results
Test and update regexes for detecting the following entities in The Stack