Review and improve regex rules

Yelp / detect-secrets

An enterprise friendly way of detecting and preventing secrets in code.

Apache License 2.0

3.77k stars 469 forks source link

Review and improve regex rules #159

Open domanchi opened 5 years ago

domanchi commented 5 years ago

There was a recent white paper released (summary, source).

What's most interesting is on page 15, they list a variety of explicit regexes that we may be able to incorporate into our scanning. I think we already cover like 80% (mostly with the high entropy scanner), but there are some interesting ones to extract from that. e.g.:

finance related tokens
Facebook access tokens

We should go through this list and create new plugins for the ones that we're missing.

killuazhu commented 5 years ago

I love the idea. Be able to more deterministically identify the type of the token can also support #153

domanchi commented 5 years ago

A couple of notes from this paper worth mentioning (for posterity):

Section III, Part E: talks about some interesting ideas on how to better filter out junk keys (e.g. XXXX, has EXAMPLE in the text)
Section V, Part D: notes that multi-factor secrets (e.g. username and password) has an 80% chance that they both can often be found within 5 lines of context, before and after a secret.
Section VII, Part D: entropy checks still catch more than just regex rules. This is good to know, and allows users to decide how conservative they want to be (accuracy v recall trade-off).

KevinHock commented 5 years ago

I thought this part was another cool thing to experiment with:

Section III, Part D:

Note that each regular expression was prefixed with negative lookbehind (?<![\w]) and suffixed with negative lookahead (?![\w]) to ensure that no word characters appeared before or after the regular expression match and improve accuracy.