Yelp / detect-secrets

An enterprise friendly way of detecting and preventing secrets in code.
Apache License 2.0
3.75k stars 466 forks source link

Improve false negative ratio by detecting keys with hyphens #49

Closed domanchi closed 4 months ago

domanchi commented 6 years ago

Certain API keys use hyphens.

e.g. blahblah-aaaa-bbbb-cccc-ddddddd

This currently is not caught by the suite of HighEntropyStringPlugins.

KevinHock commented 6 years ago

There are also some AKIAblahblah keys that do not have high-entropy enough, but I suppose that should be a different issue :)

mohit-surana commented 5 years ago

I wanted to work on this issue. On doing a little bit of digging, I came up with two proposed solutions and would really appreciate any comments regarding the same:

1) Can we expect the client to include a hyphen within the charset? If yes, then I believe we just need to use re.escape(charset) instead of charset on this line 2) If clients should continue to use the existing charset, we need to either enrich just the regex, or both the regex and charset (internally). Unless we append the hyphen to the charset in the constructor, the entropy calculation will not use hyphens. So should hyphens be included in the entropy calculation?

Finally, I believe tests should go here, right? Thanks!

domanchi commented 5 years ago

Those are good questions @mohit-surana! The short answers are:

  1. No (we should be able to support both, holistically)
  2. Yes?

Based on the entropy algorithm, it seems that the more characters in the charset, the higher the entropy can be.

Following this logic, it would suggest that a more liberal charset may require a different entropy configuration level, seeing that the same level may produce more false positives.

However, if this is true, then any additions to the charset would require a completely separate plugin (e.g. adding hyphens and percentage signs -%), and the maintenance of these potential plugins could get very messy.

Any thoughts on this?

mohit-surana commented 5 years ago

Theoretically, yes. It would increase the false positive rate while reducing false negative rates as well. Ultimately it will be a trade-off between false positive and false negatives. Do we have any statistics regarding the current system's false positive rates?

How can we design good tests to measure the new statistics, that have a large coverage to assess the new FP/FN rates?

As for new plugins, it seems to me that ultimately, making changes to the entropy calculation is a big NO as it may affect current clients. And you can make combinatorial number of plugins if we make one for each small difference. Would it be better to allow clients to pass an additional argument indicating whether they would like to include additional preset/client specified symbols?

Bottom line: If FP increases a lot, we need to have the client make a conscious decision to move into a new version that supports hyphens.

domanchi commented 5 years ago

I'm in favor of the additional argument, but I don't know how that might look like with the user interface. Certainly would increase the scope of this issue (and perhaps no longer a "good first issue")! If you still wanted to take it on, we'd more than welcome the contribution!

Otherwise, the AKIA prefixed issue that @KevinHock mentioned may be a good start. Though it doesn't strictly find the AWS secret, it gives a good indication that there might be a secret there, in the same principle as "where there's smoke, there's fire".

As for testing FP/FN rates, we are building a large internal collection of various different secrets that we use to experiment with our new plugins. We can certainly run your plugin on our corpus, and help tweak its default sensitivity.

mohit-surana commented 5 years ago

Hey @domanchi. I am interested in implementing the additional argument version of the solution. I am a bit caught up with stuff at the moment and I'll get back to it as soon as time permits!

The internal corpus sounds like a really good idea, and in general will help attract more users as well. As for the AKIA prefix, I will need to think further to understand how we can incorporate patterns along with the entropy calculation. Let me get back to you!

lorenzodb1 commented 4 months ago

We're going to close this issue as it hasn't received any update in a very long time. Feel free to re-open it if you think it's still relevant.