HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
68 stars 23 forks source link

Added directory scanning #24

Closed GStefanowich closed 1 year ago

GStefanowich commented 1 year ago

As requested in #21 I've added file scanning. If the path that is inputted is a Directory instead of a File it will loop over that directory.


I also created a --recursive flag (-r was taken by the report function so I made it explicit) which will enter child directories as well to search for files.

I didn't want to try and adding a complex function to math from bytes to MB as you have listed in your example

Found 402 files:
.log : 90 files : 3,255,102MB
.json : 47 files : 1,091,043MB
.txt : 32 files : 76,496,578MB
.gcode : 27 files : 44,729MB
.prproj : 27 files : 2,077MB
.XML : 26 files : 44MB
.sample : 24 files : 40MB
.csv : 17 files : 261,774MB

I did also take the Ready? from your example and add a prompt before beginning to read the files. It reads Press ANY KEY to continue. Q to Quit.


I also changed how the CommandLine parser returns its inputted save-paths, to store them as a static reference instead of having to pass them around everywhere for saving. Just as a bit of a cleanup

GStefanowich commented 1 year ago

I will also note that since this deals with Files it may need further testing.

I created a method that generates a Set{String} so that if you enter both a directory and a file, if the file is contained within the directory it will not scan the same file multiple times. Files (Generally speaking, since partitioning can change this) are case-insensitive on Windows, and case-sensitive on Unix based systems. I wrote a simple poor implementation that checks for the OS instead of something complex that checks the partition.

There are also the occasionally oddities like Directory separators being / or \. I'm on Linux so if something is funky on Windows let me know

GStefanowich commented 1 year ago

My test directory as follow:

- MultipleFiles/
    - File1.txt
    - File2.txt
    - File3.txt
    - Inner/
        - File4.txt
        - something.mp4
        - another.MP4
        - none.none

Will output:

Found 7 files:
.txt: 4 files : 159 bytes
.none: 1 files : 0 bytes, Skipping (Unknown Extension)
.mp4: 2 files : 0 bytes, Skipping (Audio/Video files)
Press ANY KEY to continue. Q to Quit.
troyhunt commented 1 year ago

Love your work, that looks great! Cursory run seems all fine, late here now so I'll try and run this in anger against a real breach tomorrow.