Anti-Malware-Alliance / secret-harvest

Python Project to Automate Collection of Snippets with Leaked Secrets in Code to Build a Dataset for ML Trainning.
MIT License
2 stars 0 forks source link

Initial POC of the tool #1

Closed rothoma2 closed 3 months ago

rothoma2 commented 3 months ago

Purpose of the Tool is to aid in generating Dataset with 2 use cases:

  1. Code Snippets with Leaked / Exposed Secret credentials inside.
  2. Code Snippets clean with no Leaked or Exposed Secret Credentials inside.

This dataset is currently hard to come by and will be made available for researchers.

This tools, uses the GitHub API to Hunt for Repositories that have a hit on suspicious queries, clones them, runs truffleghog3 on top of it, and produces the snippets. Snippets are 10 lines long with Secret, or Clean samples from the same repositories.

A metadata file is appended to the result file with data on the findings from trufflehog.

rothoma2 commented 3 months ago

Currently no collaborators so self approving.