Comcast / xGitGuard

AI based Secrets Detection Python Framework
Apache License 2.0
59 stars 27 forks source link
github ml opensource python secrets-detection security-tools

xGitGuard

AI-Based Secrets Detection
Detect Secrets (API Tokens, Usernames, Passwords, etc.) Exposed on GitHub Repositories
Designed and Developed by Comcast Cybersecurity Research and Development Team

OpenSSF Scorecard License Code style: black


Contents

Overview

xGitGuard Workflow

Features

Credential Detection Workflow

Keys&Token Detection Workflow

Install

Environment Setup

Search Patterns

Usage

Enterprise Github Secrets Detection

API Configuration Setup

Running Enterprise Secret Detection

Enterprise Credential Secrets Detection

Detections Without Additional ML Filter

By default, the Credential Secrets Detection script runs for given Secondary Keywords and extensions without ML Filter.

# Run with Default configs
python enterprise_cred_detections.py
Detection With ML Filter

xGitGuard also has an additional ML filter where users can collect their organization/targeted data and train their model. Having this ML filter helps to reduce the false positives from the detection.

Pre-Requisite To Use the ML Filter

User Needs to follow the below process to collect data and train the model to use ML filter.

NOTE :

  • To use ML Filter, ML training is mandatory. This includes data collection, feature engineering & model persisting.
  • This process is going to be based on user requirements. It can be one time or if the user needs to improve the data, then needs to be done periodically.
Command to Run Enterprise Credential Scanner with ML
# Run for given Secondary Keyword and extension with ML model,
python enterprise_cred_detections.py -m Yes
Command to Run Enterprise Credentials Scanner for targeted organization
# Run for targeted org,
python enterprise_cred_detections.py -o org_name        #Ex: python enterprise_cred_detections.py -o test_org
Command to Run Enterprise Credentials Scanner for targeted repo
# Run for targeted repo,
python enterprise_cred_detections.py -r org_name/repo_name     #Ex: python enterprise_cred_detections.py -r test_org/public_docker
Command-Line Arguments for Credential Scanner
Run usage:
enterprise_cred_detections.py [-h] [-s Secondary Keywords] [-e Extensions] [-m Ml prediction] [-u Unmask Secret] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]

optional arguments:
  -h, --help            show this help message and exit
  -s Secondary Keywords, --secondary_keywords Secondary Keywords
                          Pass the Secondary Keywords list as a comma-separated string
  -e Extensions, --extensions Extensions
                          Pass the Extensions list as a comma-separated string
  -m ML Prediction, --ml_prediction ML Prediction
                          Pass the ML Filter as Yes or No. Default is No
  -u Set Unmask, --unmask_secret To write secret unmasked, then set Yes
                          Pass the flag as Yes or No. Default is No
  -o pass org name, --org Pass the targeted org list as a comma-separated string
  -r pass repo name, --repo Pass the targeted repo list as a comma-separated string
  -l Logger Level, --log_level Logger Level
                          Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
  -c Console Logging, --console_logging Console Logging
                          Pass the Console Logging as Yes or No. Default is Yes

Enterprise Keys and Tokens Secrets Detection

Detections Without Additional ML Filter

By default, the Keys and Tokens Secrets Detection script runs for given Secondary Keywords and the extensions without ML Filter.

# Run with Default configs
python enterprise_key_detections.py
Command to Run Enterprise Keys and Tokens Scanner for targeted organization
# Run for targeted org,
python enterprise_key_detections.py -o org_name        #Ex: python enterprise_key_detections.py -o test_org
Command to Run Enterprise Keys and Tokens Scanner for targeted repo
# Run for targeted repo,
python enterprise_key_detections.py -r org_name/repo_name     #Ex: python enterprise_key_detections.py -r test_org/public_docker
Detections With ML Filter

xGitGuard also has an additional ML filter where users can collect their organization/targeted data and train their model. Having this ML filter helps in reducing the false positives from the detection.

Pre-Requisite To Use ML Feature

The user needs to follow the below process to collect data and train the model to use ML filter.

NOTE :

  • To use ML filter, ML training is mandatory. It includes data collection, feature engineering & model persisting.
  • This process is going to be based on user requirements. It can be one time or if the user needs to improve the data, then it needs to be done periodically.
Command to Run Enterprise Keys & Token Scanner with ML
# Run for given Secondary Keyword and extension with ML model
python enterprise_key_detections.py -m Yes
Command-Line Arguments for Keys & Token Scanner
Run usage:
enterprise_key_detections.py [-h] [-s Secondary Keywords] [-e Extensions] [-m Ml prediction] [-u Unmask Secret] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]

optional arguments:
  -h, --help            show this help message and exit
  -s Secondary Keywords, --secondary_keywords Secondary Keywords
                          Pass the Secondary Keywords list as a comma-separated string
  -e Extensions, --extensions Extensions
                          Pass the Extensions list as a comma-separated string
  -m ML Prediction, --ml_prediction ML Prediction
                          Pass the ML Filter as Yes or No. Default is No
  -u Set Unmask, --unmask_secret To write secret unmasked, then set Yes
                          Pass the flag as Yes or No. Default is No
  -o pass org name, --org Pass the targeted org list as a comma-separated string
  -r pass repo name, --repo Pass the targeted repo list as a comma-separated string
  -l Logger Level, --log_level Logger Level
                          Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
  -c Console Logging, --console_logging Console Logging
                          Pass the Console Logging as Yes or No. Default is Yes

Enterprise Output Format:

Output Files

Public Github Secrets Detection

Configuration Data Setup

Running Public Credential Secrets Detection

Note: User needs to remove the sample content from primary_keywords.csv and add primary keywords like targeted domain names to be searched in public GitHub.

Public Credential Secrets Detection

Detections Without Additional ML Filter

By default, Credential Secrets Detection script runs for given Primary Keyword, Secondary Keyword, and extension without ML Filter.

# Run with Default configs
python public_cred_detections.py
Command to Run Public Credential Scanner for targeted organization
# Run for targeted org,
python public_cred_detections.py -o org_name         #Ex: python public_cred_detections.py -o test_org
Command to Run Public Credential Scanner for targeted repo
# Run for targeted repo,
python public_cred_detections.py -r org_name/repo_name        #Ex: python public_cred_detections.py -r test_org/public_docker
Detections With ML Filter

xGitGuard also has an additional ML filter, where users can collect their organization/targeted data and train their model. Having this ML filter helps in reducing the false positives from the detection.

Pre-Requisite To Use ML Feature

The user needs to follow the below process to collect data and train the model to use ML filter.

NOTE :

  • To use ML Feature, ML training is mandatory. It includes data collection, feature engineering & model persisting.
Command to Run Public Credential Scanner with ML
# Run for given Primary Keyword, Secondary Keyword, and extension with ML model
python public_cred_detections.py -m Yes
Command-Line Arguments for Public Credential Scanner
Run usage:
usage: public_cred_detections.py [-h] [-p Primary Keywords] [-s Secondary Keywords] [-e Extensions] [-m Ml prediction] [-u Unmask Secret] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]

optional arguments:
-h, --help show this help message and exit
-p Primary Keywords, --primary_keywords Primary Keywords
Pass the Primary Keywords list as a comma-separated string
-s Secondary Keywords, --secondary_keywords Secondary Keywords
Pass the Secondary Keywords list as a comma-separated string
-e Extensions, --extensions Extensions
Pass the Extensions list as a comma-separated string
-m ML Prediction, --ml_prediction ML Prediction
                          Pass the ML Filter as Yes or No. Default is No
-u Set Unmask, --unmask_secret To write secret unmasked, then set Yes
                          Pass the flag as Yes or No. Default is No
-o pass org name, --org Pass the targeted org list as a comma-separated string
-r pass repo name, --repo Pass the targeted repo list as a comma-separated string
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
Public Keys and Tokens Secrets Detection
Detections Without Additional ML Filter

By default, Keys and Tokens Secret Detection script runs for given Primary Keyword, Secondary Keyword and extension without ML Filter.

# Run with Default configs
python public_key_detections.py
Command to Run Public Keys and Tokens Scanner for targeted organization
# Run  for targeted org,
python public_key_detections.py -o org_name           #Ex: python public_key_detections.py -o test_org
Command to Run Public Keys and Tokens Scanner for targeted repo
# Run for targeted repo,
python public_key_detections.py -r org_name/repo_name      #Ex: python public_key_detections.py -r test_org/public_docker
Detections With ML Filter

xGitGuard also has an additional ML filter, where users can collect their organization/targeted data and train their model. Having this ML filter helps in reducing the false positives from the detection.

Pre-Requisite To Use ML Feature

The user needs to follow the below process to collect data and train the model to use ML filter.

NOTE : To use ML Feature, ML training is mandatory. It includes data collection,feature engineering & model persisting.

Command to Run Public Keys & Tokens Secret Scanner with ML
# Run for given  Primary Keyword, Secondary Keyword, and extension with ML model,
python public_key_detections.py -m Yes
Command-Line Arguments for Public Keys & Tokens Secret Scanner
usage:
public_key_detections.py [-h] [-s Secondary Keywords] [-e Extensions] [-m Ml prediction][-u Unmask Secret] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]

optional arguments:
-h, --help show this help message and exit
-s Secondary Keywords, --secondary_keywords Secondary Keywords
Pass the Secondary Keywords list as a comma-separated string
-e Extensions, --extensions Extensions
Pass the Extensions list as a comma-separated string
-m ML Prediction, --ml_prediction ML Prediction
                          Pass the ML Filter as Yes or No. Default is No
-u Set Unmask, --unmask_secret To write secret unmasked, then set Yes
                          Pass the flag as Yes or No. Default is No
-o pass org name, --org Pass the targeted org list as a comma-separated string
-r pass repo name, --repo Pass the targeted repo list as a comma-separated string
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
Public Output Files

Note: By Default, the detected secrets will be masked to hide sensitive data. If needed, user can skip the masking to write raw secret using command line argument -u Yes or --unmask_secret Yes. Refer command line options for more details.

ML Model Training

Enterprise ML Model Training Procedure

To use ML Feature, ML training is mandatory. It includes data collection, feature engineering & model persisting.

Note: Labelling the collected secret is an important process to improve the ML prediction.

Data Collection

Traverse into the "data collector" folder under ml_training

  cd ml_data_collector\github-enterprise-ml-data-collector
Review & Label the Collected Data
  1. By default all the data collected will be labeled as 1 under column "Label" in collected training data indicating the collected secret as a valid one.
  2. User needs to review each row in the collected data and update the label value. i.e: if the user thinks collected data is not a secret, then change the value to 0 for that particular row.
  3. By doing this, ML will have quality data for the model to reduce false positives.
Feature Engineering

Traverse into the "ml_training" folder

ML Model Creation for Enterprise

Traverse into the "ml_training" folder

Public GitHub ML Model Training Procedure

To use ML Feature, ML training is mandatory. It includes data collection, feature engineering & model persisting.

Note: Labelling the collected secret is an important process to use the ML effectively.

Data Collection :

Traverse into the "data collector" folder

cd ml_training\ml_data_collector\github-public-ml-data-collector

Note: User needs to remove the sample content from primary_keywords.csv and add primary keywords like targeted domain names to be searched in public GitHub.

Note: The data collection for public GitHub is optional.

  • If targeted data collected from Enterprise is enough to use, then we can skip the data collection & Label review process
Review & Label the Collected Data:
  1. By default, all the data collected will be labeled as 1 under column "Label" in collected training data indicating the collected secret as a valid one.
  2. User needs to review each row in the collected data and update the label value. i.e: if the user thinks collected data is not a secret, then change the value to 0 for that particular row.
  3. By doing this, ML will have quality data for the model to reduce false positives.

Note: Labelling the collected secret is an important process to use the ML effectively.

Feature Engineering

Traverse into the "ml_training" folder

Note:

  • Data collection & feature engineering for public GitHub scan is optional.
  • When public training data not available, feature engineering will use enterprise source data.
ML Model Creation for Public GitHub

Traverse into the "ml_training" folder

Custom Keyword Scan

Running Enterprise Keyword Search

Enterprise Custom Keyword Search Process

Please add the required keywords to be searched into config/enterprise_keywords.csv

# Run with given configs,
python enterprise_keyword_search.py
Command to Run Enterprise Scanner for targeted organization
# Run Run for targeted org,
python enterprise_keyword_search.py -o org_name             #Ex: python enterprise_keyword_search.py -o test_ccs
Command to Run Enterprise Scanner for targeted repo
# Run Run for targeted repo,
python enterprise_keyword_search.py -r org_name/repo_name         #Ex: python enterprise_keyword_search.py -r test_ccs/ccs_repo_1
Command-Line Arguments for Enterprise keyword Scanner
Run usage:
enterprise_keyword_search.py [-h] [-e Enterprise Keywords]  [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]

optional arguments:
  -h, --help            show this help message and exit
  -e Enterprise Keywords, --enterprise_keywords Enterprise Keywords
                          Pass the Enterprise Keywords list as a comma-separated string.This is optional argument. Keywords can also be provided in the `enterprise_keywords.csv` file located in the `configs` directory.
  -o pass org name, --org Pass the targeted org list as a comma-separated string
  -r pass repo name, --repo Pass the targeted repo list as a comma-separated string
  -l Logger Level, --log_level Logger Level
                          Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
  -c Console Logging, --console_logging Console Logging
                          Pass the Console Logging as Yes or No. Default is Yes

Running Public Keyword Search

Public Custom Keyword Search Process

Please add the required keywords to be searched into config/public_keywords.csv

# Run with given configs,
python public_keyword_search.py
Command to Run Public Scanner for targeted organization
# Run Run for targeted org,
python public_keyword_search.py -o org_name                 #Ex: python public_keyword_search.py -o test_org
Command to Run Public Scanner for targeted repo
# Run Run for targeted repo,
python public_keyword_search.py -r org_name/repo_name         #Ex: python public_keyword_search.py -r test_org/public_docker
Command-Line Arguments for Public keyword Scanner
Run usage:
public_keyword_search.py [-h] [-p Public Keywords]  [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]

optional arguments:
  -h, --help            show this help message and exit
  -e Public Keywords, --public_keywords Public Keywords
                          Pass the Public Keywords list as a comma-separated string.This is optional argument. Keywords can also be provided in the `public_keywords.csv` file located in the `configs` directory.
  -o pass org name, --org Pass the targeted org list as a comma-separated string
  -r pass repo name, --repo Pass the targeted repo list as a comma-separated string
  -l Logger Level, --log_level Logger Level
                          Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
  -c Console Logging, --console_logging Console Logging
                          Pass the Console Logging as Yes or No. Default is Yes

Additional Important Notes

License

Licensed under the Apache 2.0 license.