bigcode-project / bigcode-dataset

Apache License 2.0
363 stars 61 forks source link

TF-PII Redaction regexes #17

Closed loubnabnl closed 1 year ago

loubnabnl commented 1 year ago

Test and update regexes for detecting the following entities in The Stack

liyongsea commented 1 year ago

Hi, this is update of the ssh key detection part:

Todo:

liyongsea commented 1 year ago

Here is a PR WIP. If you want to play with the detect-secrets library https://github.com/bigcode-project/bigcode-analysis/pull/24/files

terryyz commented 1 year ago

Based on a question raised on StackOverflow and the description on Wikipedia, here is a summary of edge cases which current email regex method

can't detect:

wrongly detects:

Let me know if you'd like to consider these cases.

loubnabnl commented 1 year ago

@liyongsea I looked at the docs for both git-secrets and detect-secrets. I’ve made summary of both in this document. I think they are really good starting points for keys detection TLDR:

loubnabnl commented 1 year ago

@terryyz Would be happy to consider these cases if there’s a way to cover them without decreasing the accuracy of the regex. Although I think most of them are unlikely to be present in the dataset, e.g I didn’t know many of them were allowed in emails, not sure if most platforms accept this.

EDIT: Email regex used for BigScience was updated https://regex101.com/r/uRlGkP/1, we can test it against the current one, it seems more robust

loubnabnl commented 1 year ago

@liyongsea I made another notebook for the analysis of detect-secrets to follow up on what you started, see this PR. I run some tests on a dataset of 1k samples from different languages -from annotation round- with the default plugins, and then removed some of them, there are analysis paragraphs in the notebooks. What’s left to do:

liyongsea commented 1 year ago

@loubnabnl Thank you for the analysis ! I updated my PR to integrate your suggestion. I also clean up a bit and remove the argument suffix (in the beginning I thought it impacts the output....) There is one minor problem when you try to locate the secret with str.index. Sometimes, detect-secret could find something like password=password, it might cause problem when trying to redact. I will see how to improve

loubnabnl commented 1 year ago

I updated my PR to add the PII pipeline, now only the anonymization is missing.

I also added an analysis and evaluation notebook with some observations about our regexes' behavior (the evaluation is still WIP):

liyongsea commented 1 year ago

Thank you @loubnabnl ! after looking at the FN, I spot some quick wins for detect-secrets:

loubnabnl commented 1 year ago

Interesting, I'll take a deeper look at the results