BishopFox / GitGot

Semi-automated, feedback-driven tool to rapidly search through troves of public data on GitHub for sensitive secrets.
GNU Lesser General Public License v3.0
1.46k stars 209 forks source link

Crash on large collection scraping #7

Closed Dr4s1l closed 5 years ago

Dr4s1l commented 5 years ago

Can reproduct when searching on chaturbate domain after some result analysis about 500 I think

./gitgot.py -q "secure.chaturbate.com" -f checks/default.list 
Traceback (most recent call last):
  File "./gitgot.py", line 365, in <module>
    main()
  File "./gitgot.py", line 361, in main
    api_request_loop(state)
  File "./gitgot.py", line 239, in api_request_loop
    if should_parse(repo, state) or stepBack:
  File "./gitgot.py", line 101, in should_parse
    candidate_sig = ssdeep.hash(repo.decoded_content)
  File "/home/yggdrasil/.local/lib/python3.6/site-packages/github/ContentFile.py", line 62, in decoded_content
    assert self.encoding == "base64", "unsupported encoding: %s" % self.encoding
  File "/home/yggdrasil/.local/lib/python3.6/site-packages/github/ContentFile.py", line 82, in encoding
    self._completeIfNotSet(self._encoding)
  File "/home/yggdrasil/.local/lib/python3.6/site-packages/github/GithubObject.py", line 263, in _completeIfNotSet
    self._completeIfNeeded()
  File "/home/yggdrasil/.local/lib/python3.6/site-packages/github/GithubObject.py", line 267, in _completeIfNeeded
    self.__complete()
  File "/home/yggdrasil/.local/lib/python3.6/site-packages/github/GithubObject.py", line 272, in __complete
    self._url.value
  File "/home/yggdrasil/.local/lib/python3.6/site-packages/github/Requester.py", line 275, in requestJsonAndCheck
    return self.__check(*self.requestJson(verb, url, parameters, headers, input, self.__customConnection(url)))
  File "/home/yggdrasil/.local/lib/python3.6/site-packages/github/Requester.py", line 286, in __check
    raise self.__createException(status, responseHeaders, output)
github.GithubException.UnknownObjectException: 404 {'message': 'Not Found', 'documentation_url': 'https://developer.github.com/v3/repos/contents/#get-contents'}
the-bumble commented 5 years ago

Hi Dr4s1l,

Thanks for the bug report! I just pushed a fix to Master for this.

It was an interesting edge case. Evidently, there are times where files reported by the GitHub API no longer exist on github.com. A file must have been deleted shortly after the initial results listing but before GitGot retrieved file contents. This is now addressed through exception handling.

If you go in your latest JSON state file or ratelimited state file, you can manually adjust the index forward to pick up around where you left off index:500.

Happy hunting!