bigcode-project / bigcode-dataset

Apache License 2.0
366 stars 61 forks source link

TF-PII Redaction Benchmark #18

Closed loubnabnl closed 1 year ago

loubnabnl commented 2 years ago

Build a benchmark (for short term) for PII detection of Emails, IP addresses and SSH & API keys:

loubnabnl commented 2 years ago

Found this paper for AWS credentials detection in Java files, we can test something similar, they report 100% recall and 91% precision but the test set is not large and from 2015 (it's not publicly available either)

loubnabnl commented 2 years ago

Here is the benchmark of 400 samples we annotated for: Emails, IP addresses, SSH & API keys, Names and Usernames.