cfpb / clouseau

⚠️ THIS PROJECT IS DEPRECATED ⚠️ Search your repository's git history for undesirable text patterns such as passwords, ssh keys and other personal identifiable information
Creative Commons Zero v1.0 Universal
97 stars 26 forks source link

Support running against a local repo as a git hook #13

Closed marcesher closed 10 years ago

marcesher commented 10 years ago

It'd be useful to run clouseau locally as a pre-commit hook. It would not prevent commits, but it'd give people an opportunity to amend the commit before pushing, should something untoward be found.

marcesher commented 10 years ago

@virtix @dlapiduz

I've got an idea that will address this and #12 . It'll be a fairly significant addition to clouseau, but it will not change any existing behavior.

Short version: I want to add the ability to inspect a range, such as the current commit, or all commits that are part of a pull request. This would use git log and git diff instead of git grep. It'd put the results into the same format that we're using with the output of git grep and consequently those results would still feed into the existing clients.

Why not git grep for this?

Because git grep isn't clever enough to inspect just the changes between a range of commits. Rather, it inspects the entire tree. Consequently, the same findings are going to come up, over and over.

How will this work?

I'll write a new function that accepts a range. Default would be HEAD and HEAD~1. For a pull request, you could pass "master" and "origin/pr/162", for example.

  1. It would use git log to get the commit messages for the range:

git log --pretty=format:"%B" -1 for the current commit

git log origin/pr/162 master --pretty=format:"%B" to get all commit messages that are part of a pull request

Then, for each term, we would loop over the lines and detect that term's presence. Findings get added to the result data

  1. For each term from the patterns files, it would use git diff to find diffs containing that term:

git diff HEAD~1 HEAD -Sspecifiied --unified=0 --minimal -w to find the term 'specifiied' in the current commit, which produces output like:

<clouseau> (master)$ git diff HEAD~1 HEAD -Sspecifiied --unified=0 --minimal -w
       diff --git a/clouseau/clouseau.py b/clouseau/clouseau.py
       index d307f15..1b71294 100755
       --- a/clouseau/clouseau.py
       +++ b/clouseau/clouseau.py
       @@ -100 +100 @@ class Clouseau:
       -                        help="If specifiied, skips any calls to git-clone or git-pull.")
       +                        help="If specified, skips any calls to git-clone or git-pull. Useful in combination with --dest to test a local git repo")

git diff origin/pr/162 master -Svalue --unified=0 --minimal -w for a pull request to get a diff of anything containing the word 'value', which produces output like:

diff --git a/resources/static/js/data-api.js b/resources/static/js/data-api.js
index 23b9284..22171cb 100644
--- a/resources/static/js/data-api.js
+++ b/resources/static/js/data-api.js
@@ -5,0 +6 @@
+      .chain()
@@ -12 +13,2 @@
-      }, {});
+      }, {})
+      .value();
@@ -21 +23,3 @@
-    var formString = _(formVals).pairs()
+    var formString = _(formVals)
+      .chain()
+      .pairs()
@@ -25,0 +30 @@
+      .value()

Ideally, git grep would get us all this but I can't see a way to do that. Obviously, this isn't as clean as git grep, and it'll be a fair bit of work to parse these outputs, which is why I wanted to propose this approach before implementing.

marcesher commented 10 years ago

@virtix @dlapiduz I've added support for commit parsing, including a list of commits (which supports the PR use case), to the "parse_commit" branch. There's a really crappy unit test which demonstrates the behavior, though I need to add assertions and make it an actual test.

Sample usage on another project looks like this:

clouseau -u https://github.com/marcesher/cato --skip --dest $(dirname $(pwd)) --revlist="867c3ab 3768636"

which shows how to inspect multiple commits against a local git repo.

Sample output of that command, using the new "thin" client, looks like:

<cato> (master)$ clouseau_thin -u https://github.com/marcesher/cato --skip --dest $(dirname $(pwd)) --revlist="867c3ab 3768636"
Skipping git-clone or git-pull as --skip was found on the command line.
Clouseau: a silly git inspector, searching https://github.com/marcesher/cato

✓  hooktest.txt
Search term:  password[ ]*=[ ]*.+
https://github.com/marcesher/cato/commit/867c3ab938becbafb96d922e0f8134b1f0faf354
Author: Marc Esher <marc.esher@gmail.com> Date:   Tue Feb 25 09:28:34 2014 -0500
Clouseau should flag some stuff here

+My password="ILovePumpkins"  Line:9
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

✓  hooktest.txt
Search term:  [0-9]{3}[\.\-][0-9]{2}[\.\-][0-9]{4}
https://github.com/marcesher/cato/commit/867c3ab938becbafb96d922e0f8134b1f0faf354
Author: Marc Esher <marc.esher@gmail.com> Date:   Tue Feb 25 09:28:34 2014 -0500
Clouseau should flag some stuff here

+My SSN is 173-22-1322  Line:13
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

✓  hooktest2.txt
Search term:  [0-9]{3}[\.\-][0-9]{2}[\.\-][0-9]{4}
https://github.com/marcesher/cato/commit/867c3ab938becbafb96d922e0f8134b1f0faf354
Author: Marc Esher <marc.esher@gmail.com> Date:   Tue Feb 25 09:28:34 2014 -0500
Clouseau should flag some stuff here

+My SSN is 888-99-7654  Line:4
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

✓  Commit Message
Search term:  [0-9]{3}[\.\-][0-9]{2}[\.\-][0-9]{4}
https://github.com/marcesher/cato/commit/3768636cc08d3d3ad21f14c01256e52f082c5ad1
Author: Marc Esher <marc.esher@gmail.com> Date:   Tue Feb 25 09:52:00 2014 -0500
Both file and commit contain changes clouseau should flag. SSN is 167-99-0000

Both file and commit contain changes clouseau should flag. SSN is 167-99-0000  Line:1
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

✓  hooktest2.txt
Search term:  [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}
https://github.com/marcesher/cato/commit/867c3ab938becbafb96d922e0f8134b1f0faf354
Author: Marc Esher <marc.esher@gmail.com> Date:   Tue Feb 25 09:28:34 2014 -0500
Clouseau should flag some stuff here

+My IP is 127.2.3.5  Line:6
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

✓  hooktest.txt
Search term:  username[ ]*=[ ]*.+
https://github.com/marcesher/cato/commit/3768636cc08d3d3ad21f14c01256e52f082c5ad1
Author: Marc Esher <marc.esher@gmail.com> Date:   Tue Feb 25 09:52:00 2014 -0500
Both file and commit contain changes clouseau should flag. SSN is 167-99-0000

+My  username="TheKidDontPlay"  Line:15
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

-------------------------------------------------------------------------------
      What do you do when you find something that should't be there?

         https://help.github.com/articles/remove-sensitive-data
-------------------------------------------------------------------------------
marcesher commented 10 years ago

@dlapiduz @virtix Running as a post-commit hook is now supported and documented in the parse_commit branch. Once I add proper unit tests, I'll merge to master.