AI-Based Secrets Detection
Detect Secrets (API Tokens, Usernames, Passwords, etc.) Exposed on GitHub Repositories
Designed and Developed by Comcast Cybersecurity Research and Development Team
GitHub Enterprise
accountPublic Credential Secrets Detection - Run Secret detection on the GitHub Public
account
GitHub Enterprise
accountPublic Keys and Tokens Secrets Detection - Run Secret detection on the GitHub Public
account
Install Python >= v3.6
Clone/Download the repository from GitHub
Traverse into the cloned xGitGuard
folder
cd xGitGuard
Install Python Dependency Packages
python -m pip install -r requirements.txt
Check for Outdated Packages
pip list --outdated
There are two ways to define configurations in xGitGuard
For Enterprise
Github Detection (Secondary Keyword + Extension)
under config directory
For Public
Github Detection (Primary Keyword + Secondary Keyword + Extension)
under config directory
GITHUB_ENTERPRISE_TOKEN
- Enterprise GitHub API Token with full scopes of repository and user.your Enterprise Name
in config file xgg_configs.yaml
in config Data folder xgitguard\config\*
https://github.<<
Enterprise_Name
>>.com/api/v3/search/code
https://github.<<
Enterprise_Name
>>.com/api/v3/repos/
https://github.<<
Enterprise_Name
>>.com/api/v3/search/code
https://github.<<
Enterprise_Name
>>.com/api/v3/repos/{user_name}/{repo_name}/commits?path={file_path}
Traverse into the github-enterprise
script folder
cd github-enterprise
By default, the Credential Secrets Detection script runs for given Secondary Keywords and extensions without ML Filter.
# Run with Default configs
python enterprise_cred_detections.py
xGitGuard also has an additional ML filter where users can collect their organization/targeted data and train their model. Having this ML filter helps to reduce the false positives from the detection.
User Needs to follow the below process to collect data and train the model to use ML filter.
NOTE :
- To use ML Filter, ML training is mandatory. This includes data collection, feature engineering & model persisting.
- This process is going to be based on user requirements. It can be one time or if the user needs to improve the data, then needs to be done periodically.
# Run for given Secondary Keyword and extension with ML model,
python enterprise_cred_detections.py -m Yes
# Run for targeted org,
python enterprise_cred_detections.py -o org_name #Ex: python enterprise_cred_detections.py -o test_org
# Run for targeted repo,
python enterprise_cred_detections.py -r org_name/repo_name #Ex: python enterprise_cred_detections.py -r test_org/public_docker
Run usage:
enterprise_cred_detections.py [-h] [-s Secondary Keywords] [-e Extensions] [-m Ml prediction] [-u Unmask Secret] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]
optional arguments:
-h, --help show this help message and exit
-s Secondary Keywords, --secondary_keywords Secondary Keywords
Pass the Secondary Keywords list as a comma-separated string
-e Extensions, --extensions Extensions
Pass the Extensions list as a comma-separated string
-m ML Prediction, --ml_prediction ML Prediction
Pass the ML Filter as Yes or No. Default is No
-u Set Unmask, --unmask_secret To write secret unmasked, then set Yes
Pass the flag as Yes or No. Default is No
-o pass org name, --org Pass the targeted org list as a comma-separated string
-r pass repo name, --repo Pass the targeted repo list as a comma-separated string
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
Inputs used for search and scan
Note: Command-line argument keywords have precedence over config files (Default). If no keywords are passed in cli, data from config files will be used for the search.
GitHub search pattern for above examples: password +extension:py
By default, the Keys and Tokens Secrets Detection script runs for given Secondary Keywords and the extensions without ML Filter.
# Run with Default configs
python enterprise_key_detections.py
# Run for targeted org,
python enterprise_key_detections.py -o org_name #Ex: python enterprise_key_detections.py -o test_org
# Run for targeted repo,
python enterprise_key_detections.py -r org_name/repo_name #Ex: python enterprise_key_detections.py -r test_org/public_docker
xGitGuard also has an additional ML filter where users can collect their organization/targeted data and train their model. Having this ML filter helps in reducing the false positives from the detection.
The user needs to follow the below process to collect data and train the model to use ML filter.
NOTE :
- To use ML filter, ML training is mandatory. It includes data collection, feature engineering & model persisting.
- This process is going to be based on user requirements. It can be one time or if the user needs to improve the data, then it needs to be done periodically.
# Run for given Secondary Keyword and extension with ML model
python enterprise_key_detections.py -m Yes
Run usage:
enterprise_key_detections.py [-h] [-s Secondary Keywords] [-e Extensions] [-m Ml prediction] [-u Unmask Secret] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]
optional arguments:
-h, --help show this help message and exit
-s Secondary Keywords, --secondary_keywords Secondary Keywords
Pass the Secondary Keywords list as a comma-separated string
-e Extensions, --extensions Extensions
Pass the Extensions list as a comma-separated string
-m ML Prediction, --ml_prediction ML Prediction
Pass the ML Filter as Yes or No. Default is No
-u Set Unmask, --unmask_secret To write secret unmasked, then set Yes
Pass the flag as Yes or No. Default is No
-o pass org name, --org Pass the targeted org list as a comma-separated string
-r pass repo name, --repo Pass the targeted repo list as a comma-separated string
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
Inputs used for search and scan
Note: Command-line argument keywords have precedence over config files (Default). If no keywords are passed in cli, data from the config files will be used for search.
GitHub search pattern for above examples: api_key +extension:py
Credentials
1. Hashed Url Files: xgitguard\output\*_enterprise_hashed_url_creds.csv
- List previously Processed Search urls. Urls stored will be skipped in next run to avoid re processing.
2. Secrets Detected: xgitguard\output\*_xgg_enterprise_creds_detected.csv
3. Log File: xgitguard\logs\enterprise_key_detections_*yyyymmdd_hhmmss*.log
Keys & Tokens
1. Hashed Url Files: xgitguard\output\*_enterprise_hashed_url_keys.csv
- List previously Processed Search urls. Urls stored will be skipped in next run to avoid re processing.
2. Secrets Detected: xgitguard\output\*_xgg_enterprise_keys_detected.csv
3. Log File: xgitguard\logs\enterprise_key_detections_*yyyymmdd_hhmmss*.log
GITHUB_TOKEN
- Public GitHub API Token with full scopes of the repository and user.xgitguard\config\*
github-public
script folder
cd github-public
Note: User needs to remove the sample content from primary_keywords.csv and add primary keywords like targeted domain names to be searched in public GitHub.
By default, Credential Secrets Detection script runs for given Primary Keyword, Secondary Keyword, and extension without ML Filter.
# Run with Default configs
python public_cred_detections.py
# Run for targeted org,
python public_cred_detections.py -o org_name #Ex: python public_cred_detections.py -o test_org
# Run for targeted repo,
python public_cred_detections.py -r org_name/repo_name #Ex: python public_cred_detections.py -r test_org/public_docker
xGitGuard also has an additional ML filter, where users can collect their organization/targeted data and train their model. Having this ML filter helps in reducing the false positives from the detection.
The user needs to follow the below process to collect data and train the model to use ML filter.
NOTE :
- To use ML Feature, ML training is mandatory. It includes data collection, feature engineering & model persisting.
# Run for given Primary Keyword, Secondary Keyword, and extension with ML model
python public_cred_detections.py -m Yes
Run usage:
usage: public_cred_detections.py [-h] [-p Primary Keywords] [-s Secondary Keywords] [-e Extensions] [-m Ml prediction] [-u Unmask Secret] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]
optional arguments:
-h, --help show this help message and exit
-p Primary Keywords, --primary_keywords Primary Keywords
Pass the Primary Keywords list as a comma-separated string
-s Secondary Keywords, --secondary_keywords Secondary Keywords
Pass the Secondary Keywords list as a comma-separated string
-e Extensions, --extensions Extensions
Pass the Extensions list as a comma-separated string
-m ML Prediction, --ml_prediction ML Prediction
Pass the ML Filter as Yes or No. Default is No
-u Set Unmask, --unmask_secret To write secret unmasked, then set Yes
Pass the flag as Yes or No. Default is No
-o pass org name, --org Pass the targeted org list as a comma-separated string
-r pass repo name, --repo Pass the targeted repo list as a comma-separated string
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
Inputs used for search and scan
Note: Command line argument keywords have precedence over config files (Default). If no keywords are passed in cli, config files data will be used for search.
GitHub search pattern for above examples: abc.xyz.com password +extension:py
By default, Keys and Tokens Secret Detection script runs for given Primary Keyword, Secondary Keyword and extension without ML Filter.
# Run with Default configs
python public_key_detections.py
# Run for targeted org,
python public_key_detections.py -o org_name #Ex: python public_key_detections.py -o test_org
# Run for targeted repo,
python public_key_detections.py -r org_name/repo_name #Ex: python public_key_detections.py -r test_org/public_docker
xGitGuard also has an additional ML filter, where users can collect their organization/targeted data and train their model. Having this ML filter helps in reducing the false positives from the detection.
The user needs to follow the below process to collect data and train the model to use ML filter.
NOTE : To use ML Feature, ML training is mandatory. It includes data collection,feature engineering & model persisting.
# Run for given Primary Keyword, Secondary Keyword, and extension with ML model,
python public_key_detections.py -m Yes
usage:
public_key_detections.py [-h] [-s Secondary Keywords] [-e Extensions] [-m Ml prediction][-u Unmask Secret] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]
optional arguments:
-h, --help show this help message and exit
-s Secondary Keywords, --secondary_keywords Secondary Keywords
Pass the Secondary Keywords list as a comma-separated string
-e Extensions, --extensions Extensions
Pass the Extensions list as a comma-separated string
-m ML Prediction, --ml_prediction ML Prediction
Pass the ML Filter as Yes or No. Default is No
-u Set Unmask, --unmask_secret To write secret unmasked, then set Yes
Pass the flag as Yes or No. Default is No
-o pass org name, --org Pass the targeted org list as a comma-separated string
-r pass repo name, --repo Pass the targeted repo list as a comma-separated string
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
Inputs used for search and scan
Note: Command line argument keywords have precedence over config files (Default). If no keywords are passed in cli, config files data will be used for search.
GitHub search pattern for above examples: abc.xyz.com api_key +extension:py
Credentials
1. Hashed Url Files: xgitguard\output\*_public_hashed_url_creds.csv
- List pf previously Processed Search urls. Urls stored will be skipped in next run to avoid re processing.
2. Secrets Detected: xgitguard\output\*_xgg_public_creds_detected.csv
3. Log File: xgitguard\logs\public_key_detections_*yyyymmdd_hhmmss*.log
Keys & Tokens
1. Hashed Url Files: xgitguard\output\*_public_hashed_url_keys.csv
- List pf previously Processed Search urls. Urls stored will be skipped in next run to avoid re processing.
2. Secrets Detected: xgitguard\output\*_xgg_public_keys_detected.csv
3. Log File: xgitguard\logs\public_key_detections_*yyyymmdd_hhmmss*.log
Note: By Default, the detected secrets will be masked to hide sensitive data. If needed, user can skip the masking to write raw secret using command line argument
-u Yes or --unmask_secret Yes
. Refer command line options for more details.
To use ML Feature, ML training is mandatory. It includes data collection, feature engineering & model persisting.
Note: Labelling the collected secret is an important process to improve the ML prediction.
Traverse into the "ml_training" folder
cd ml_training
Traverse into the "data collector" folder under ml_training
cd ml_data_collector\github-enterprise-ml-data-collector
Credentials
python enterprise_cred_data_collector.py
python enterprise_cred_data_collector.py -h
xgitguard\output\cred_train_source.csv
folderKeys & Tokens
python enterprise_key_data_collector.py
python enterprise_key_data_collector.py -h
xgitguard\output\key_train_source.csv
folderUser needs to review each row in the collected data and update the label value.
i.e: if the user thinks collected data is not a secret, then change the value to 0 for that particular row.
Traverse into the "ml_training" folder
Credentials
python ml_feature_engineering.py cred
xgitguard\output\cred_train.csv
folderKeys & Tokens
python ml_feature_engineering.py key
xgitguard\output\key_train.csv
folderTraverse into the "ml_training" folder
Run training with Cred Training Data and persist model
python model.py cred
Run training with Key Training Data and persist model
python model.py key
For help on command line arguments, run
python model.py -h
Note: If persisted model xgitguard\output\xgg_*.pickle is not present in the output folder, then use engineered data to create a model and persist it.
To use ML Feature, ML training is mandatory. It includes data collection, feature engineering & model persisting.
Note: Labelling the collected secret is an important process to use the ML effectively.
Traverse into the "models" folder
cd ml_training
Traverse into the "data collector" folder
cd ml_training\ml_data_collector\github-public-ml-data-collector
Note: User needs to remove the sample content from primary_keywords.csv and add primary keywords like targeted domain names to be searched in public GitHub.
Credentials
python public_cred_data_collector.py
python public_cred_data_collector.py -h
xgitguard\output\public_cred_train_source.csv
folderKeys & Tokens
python public_key_data_collector.py
python public_key_data_collector.py -h
xgitguard\output\public_key_train_source.csv
folderNote: The data collection for public GitHub is optional.
- If targeted data collected from Enterprise is enough to use, then we can skip the data collection & Label review process
User needs to review each row in the collected data and update the label value.
i.e: if the user thinks collected data is not a secret, then change the value to 0 for that particular row.
Note: Labelling the collected secret is an important process to use the ML effectively.
Traverse into the "ml_training" folder
Credentials
python ml_feature_engineering.py cred -s public
xgitguard\output\public_cred_train.csv
folderKeys & Tokens
python ml_feature_engineering.py key -s public
xgitguard\output\public_key_train.csv
folderNote:
- Data collection & feature engineering for public GitHub scan is optional.
- When public training data not available, feature engineering will use enterprise source data.
Traverse into the "ml_training" folder
Run training with Cred Training Data and persist model with public source data
python model.py cred -s public
Run training with Key Training Data and persist model with public source data
python model.py key -s public
For help on command line arguments, run
python model.py -h
Note:
- If persisted model xgitguard\output\public_*xgg*.pickle is not present in the output folder, then use feature engineered data to create a model and persist it.
- By default, when feature engineered data collected in Public mode not available, then model creation will be using enterprise-based engineered data.
Traverse into the custom-keyword-search
script folder
cd custom-keyword-search
Please add the required keywords to be searched into config/enterprise_keywords.csv
# Run with given configs,
python enterprise_keyword_search.py
# Run Run for targeted org,
python enterprise_keyword_search.py -o org_name #Ex: python enterprise_keyword_search.py -o test_ccs
# Run Run for targeted repo,
python enterprise_keyword_search.py -r org_name/repo_name #Ex: python enterprise_keyword_search.py -r test_ccs/ccs_repo_1
Run usage:
enterprise_keyword_search.py [-h] [-e Enterprise Keywords] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]
optional arguments:
-h, --help show this help message and exit
-e Enterprise Keywords, --enterprise_keywords Enterprise Keywords
Pass the Enterprise Keywords list as a comma-separated string.This is optional argument. Keywords can also be provided in the `enterprise_keywords.csv` file located in the `configs` directory.
-o pass org name, --org Pass the targeted org list as a comma-separated string
-r pass repo name, --repo Pass the targeted repo list as a comma-separated string
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
Please add the required keywords to be searched into config/public_keywords.csv
# Run with given configs,
python public_keyword_search.py
# Run Run for targeted org,
python public_keyword_search.py -o org_name #Ex: python public_keyword_search.py -o test_org
# Run Run for targeted repo,
python public_keyword_search.py -r org_name/repo_name #Ex: python public_keyword_search.py -r test_org/public_docker
Run usage:
public_keyword_search.py [-h] [-p Public Keywords] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]
optional arguments:
-h, --help show this help message and exit
-e Public Keywords, --public_keywords Public Keywords
Pass the Public Keywords list as a comma-separated string.This is optional argument. Keywords can also be provided in the `public_keywords.csv` file located in the `configs` directory.
-o pass org name, --org Pass the targeted org list as a comma-separated string
-r pass repo name, --repo Pass the targeted repo list as a comma-separated string
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
Licensed under the Apache 2.0 license.