HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
68 stars 23 forks source link

Scan all files in a directory or subdirectory #21

Closed troyhunt closed 1 year ago

troyhunt commented 1 year ago

Sometimes a data breach is spread across multiple files, for example multiple .sql files generated by sqlmap. We can already pass explicit multiple paths to this tool, but we also need the ability just to index everything in a directory and any subdirectories. This could be done with the same command line args and if the input path is a directory it's just handled differently to if it's a file, but I'll leave that decision up to whoever implements this.

In my original implementation, when scanning an entire directory I first generated a report of the total number of files, types and sizes then prompted if I'd like to proceed, for example (done on all the crap in my temp folder 🤣):

Found 402 files:
.log : 90 files : 3,255,102MB
.json : 47 files : 1,091,043MB
.txt : 32 files : 76,496,578MB
.gcode : 27 files : 44,729MB
.prproj : 27 files : 2,077MB
.XML : 26 files : 44MB
.sample : 24 files : 40MB
.csv : 17 files : 261,774MB
...
Ready?

By representing the largest file types first I could get a good idea of how the data is distributed. It then went through the largest file to the smallest and ran pretty much the exact same code we already have in this repo on a file by file basis, adding a distinct count from each file to the console (the output we already have is perfect) before adding it to a single large collection then writing a distinct set of addresses from there (the same address often appears in multiple files) and writing the overall summary to the console.

One more variable: there should be a list of ignored file types that shouldn't be processed. These can be defined in the app config as they're consistent across executions. For example, here's what I currently have defined (these are all file types I've seen in previous breach corpuses but can't extract addresses from):

      // Archives
      ".tar", ".gz", ".zip", ".rar", ".7z",

      // Images
      ".png", ".tif", ".jpg", ".jpeg", ".gif", ".bmp", ".ai", ".psd", ".svg", ".ico",

      // AV
      ".rec", ".mp3", ".wav", ".mp4", ".mpg", ".mov", ".wmv", ".avi", ".m4v",

      // MySQL
      ".frm", ".ibd", ".myi", ".myd",

      // Code
      ".go", ".py", ".js", ".yml", ".php", ".c", ".sh", ".css", ".less", ".npmignore", ".groovy", ".scala", ".sass", ".ascx", ".markdown", ".bash", ".sln", ".h", ".ts", ".cs", ".aspx", ".csproj", ".nupk", ".suo", ".asax", ".resx", ".refesh", ".ipch",

      // Source control
      ".svn-base", ".gitignore", ".gitattributes", ".pack",

      // Executables
      ".exe", ".dll", ".apk", ".jar", ".java",  ".bin",

      // Other
      ".msi", ".flv", ".swf", ".pdb", ".brd", ".hprof", ".lock", ".docker", ".ttf", ".woff", ".woff2", ".pem", ".crt",

      // Should be able to read these in the future:
      ".xls", ".doc", ".docx", ".ppt", ".pptx", ".pdf", ".rdb"