Yelp / detect-secrets

An enterprise friendly way of detecting and preventing secrets in code.
Apache License 2.0
3.58k stars 449 forks source link

detect-secrets not identifying all Github token occurrences in a file #858

Open karamuz opened 1 week ago

karamuz commented 1 week ago

For example, given the file test_ghp.txt:

GITHUB_USERID=11111111
PERSONAL_ACCESS_TOKEN=ghp_ab123cDEfGhiz1UabC1cDfGhIj4KlM1NO1P1
GITHUB_USERID=99999999
PERSONAL_ACCESS_TOKEN=ghp_Zx123yDEfGhij9UvW5xCdEfGhIj7MnO4PR2Q

When I scan the file, I get these results:

  "results": {
    "test_ghp.txt": [
      {
        "type": "GitHub Token",
        "filename": "fast.txt",
        "hashed_secret": "e175c6f5f2a92e8623bd9a4820edb4e8c1b0fd10",
        "is_verified": false,
        "line_number": 2
      }
    ]
  },
  "generated_at": "2024-06-20T12:54:36Z"

As referenced in #493, if the secret is written into a file at multiple locations, only the first one is identified by detect-secrets. The problem here is that having multiple GitHub tokens with different values in the same file, they are still interpreted as if they were the same.

In the regular expression used here:

(ghp|gho|ghu|ghs|ghr)_[A-Za-z0-9_]{36} There is one capturing group: (ghp|gho|ghu|ghs|ghr). This group is designed to match and capture the prefix part of a GitHub token.

Because of this capturing group, when findall() processes a string matching this pattern, it does not return the entire match ("ghp_...36 characters..."). Instead, it returns only the part of the match that corresponds to the capturing group, which in your test cases would be "ghp", "gho", etc., depending on the token.

Example: If you were to run findall() on a string like "Test ghp_abc123...", given the regex above, the output would be:

['ghp'] # Instead of ['ghp_abc123...'] This output occurs because findall() focuses solely on the capturing group, rather than the entire pattern.

The expected behavior would be to capture all the different secrets in a file.

In the analyze_string function, maybe using finditer() could solve the issue to ensure that the entire matching string is retrieved.

for match in regex.finditer(string):
    yield match.group(0) # Returns the entire matched string

finditer() yields match objects from which you can extract specific groups or the entire match (via match.group(0)), providing flexibility and precision in handling regex matches.