Yelp / detect-secrets

An enterprise friendly way of detecting and preventing secrets in code.
Apache License 2.0
3.58k stars 449 forks source link

Fix check bypass via square brackets Issue #857 #860

Open tsigouris007 opened 1 week ago

tsigouris007 commented 1 week ago

Fixes behavior on missed secrets that have square brackets before their default value. Example:

"<%= ENV['CLIENT_ACCESS_KEY_ID'].presence || 'AKIA123456789ABCDEF1' %>"

Issue: https://github.com/Yelp/detect-secrets/issues/857

Escapes the square brackets and catches secrets properly.

No breaking changes.

tsigouris007 commented 1 week ago

Hi @tsigouris007, thank you for your PR!

Could you explain what the bug was?

Hi, You can visit the issue for the compete write-up.

lorenzodb1 commented 1 week ago

Hi @tsigouris007, #857 describes the issue, I was looking for an explanation around what's causing said issue in the code

tsigouris007 commented 1 week ago

Hi @tsigouris007, #857 describes the issue, I was looking for an explanation around what's causing said issue in the code

I haven't quite had the time to find the exact location that brakes this. The combined opening/closing square brackets [] seem to be causing the lines to not be fetched at all for checking.

For example I got AWSKeyDetector's code into a separate standalone python file and the following secret was caught properly by the regex:

access_key_id: "<%= ENV['CLIENT_ACCESS_KEY_ID'].presence || 'AKIA123456789ABCDEF1' %>"

I am taking wild a guess here and I most probably think the problem is how the checked file lines are fetched (and that's why I thought of "escaping" the square brackets earlier). As far as I can help atm you can go to detect_secrets/core/scan.py method _process_line_based_plugins and the following condition is returning True and is quitting early:

if _is_filtered_out(
            required_filter_parameters=['line'],
            filename=filename,
            line=line,
            context=code_snippet,
):

If you remove it or add a pass instead of continue these secrets are caught because it hits the else clause in _is_filtered_out loop returning a debug message of:

Skipping secret due to `detect_secrets.filters.heuristic.is_indirect_reference`.

I can take a closer look but it will take me at least a couple of days.

lorenzodb1 commented 1 week ago

Understanding what exactly is causing the bug will help us finding a good solution moving forward rather than a workaround that could potentially add another set of issues, hence why I'm asking this. Take your time and please let me know what you'll find!

tsigouris007 commented 1 week ago

Understanding what exactly is causing the bug will help us finding a good solution moving forward rather than a workaround that could potentially add another set of issues, hence why I'm asking this. Take your time and please let me know what you'll find!

It's understandable. I 'll try to get back to this asap.

tsigouris007 commented 1 week ago

Hi @lorenzodb1 , so I found where the problem occurs. In file detect_secrets/filters/heuristic.py there is _get_indirect_reference_regex with a regex of:

return re.compile(r'([^\v=!:]*)\s*(:=?|[!=]{1,3})\s*([\w.-]+[\[\(][^\v]*[\]\)])')

So the choice is to either change this regex a bit or escape earlier the square brackets. The latter could be done as is in my PR or we could do it also in is_indirect_reference function inside the heuristic.py file right before the boolean return as shown:

def is_indirect_reference(line: str) -> bool:
    """
    Filters secrets that take the form of:

        secret = get_secret_key()

    or

        secret = request.headers['apikey']
    """
    # Constrain line length as the heuristic's intention is to target lines that resemble
    # function calls. The constraint avoids catastrophic backtracking failures of the regex.
    if len(line) > 1000:
        return False
    line = line.replace('[', '\[').replace(']', '\]')
    return bool(_get_indirect_reference_regex().search(line))

This produces the correct results.

After playing around with the regex I came up with the following that seems to be working (tested on a huge repo with ~1.2k secrets):

def _get_indirect_reference_regex() -> Pattern:
    # Regex details:
    #   ([^\v=!:]*)     ->  Something before the assignment or comparison
    #   \s*             ->  Some optional whitespaces
    #   (:=?|[!=]{1,3}) ->  Assignment or comparison: :=, =, ==, ===, !=, !==
    #   \s*             ->  Some optional whitespaces
    #   (
    #       [\w.-]+     ->  Some alphanumeric character, dot or -
    #       [\[\(]      ->  Start of indirect reference: [ or (
    #       [^\v]*      ->  Something except line breaks
    #       [\]\)]      ->  End of indirect reference: ] or )
    #   )
    # return re.compile(r'([^\v=!:]*)\s*(:=?|[!=]{1,3})\s*([\w.-]+[\[\(][^\v]*[\]\)])') # Left the old one for reference
    return re.compile(r'([^\v=!:"<%>]*)\s*(:=?|[!=]{1,3})\s*([\w.-]+[\[\(][^\v]*[\]\)])')

How do you want to go through with this? I have no strong opinion.