HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
64 stars 23 forks source link

AddressExtractor refactoring #49

Closed GStefanowich closed 1 year ago

GStefanowich commented 1 year ago

This is kind of a big change, but I refactored AddressExtractor.cs so it wasn't such a hot path for contributors.

Since extracting email addresses is the bread and butter of the operation, having everyone tweaking it constantly leads to a lot of merge conflicts, already extremely so with a small-ish amount of contributors.

This PR works to change AddressExtractor so that it doesn't need to be directly modified.


I created a class called BaseFilter, which is an abstract base for allowing contributors to create their own address Filters. AddressExtractor still contains the main Regex for capturing what an email is, but filters can do filtering operations where Regex catches false positives.

The filter method implements two Methods:

virtual Result ValidateEmailAddress(EmailAddress address)
    => Result.CONTINUE;

virtual ValueTask<Result> ValidateEmailAddressAsync(EmailAddress address, CancellationToken cancellation = default)
    => ValueTask.FromResult(this.ValidateEmailAddress(address));

If for some reason a Filter needs to run asynchronously it can, as the Async method is what is called by the AddressExtractor. A contributor can override either method to create their filter.


The Address Filter returns a Result Enum:

enum Result
{
    ALLOW,
    CONTINUE,
    DENY
}

ALLOW isn't of much use, but if a Filter wants to declare that a string should immediately succeed, it can do so. CONTINUE means that it passes the filter, and will continue to run through the remaining filters, and DENY fails the string as not a valid address.

I've moved all of the current filters from the for loop into their own filters:


With this change I appear to have broken the test AddressExtractorTests.CorrectNumberOfAddressesAreExtractedFromFile()? It's reading 11 address from SingleSmallFile.txt when it expects 12.

One of the following addresses should succeed but I'm not sure which one?

Failed 'Filter quotes': '"test3@example.com"'
Failed 'Check length': 'teeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee10@example.com'
Failed 'Domain filter': 'test11@192.168'
GStefanowich commented 1 year ago

Fixed the AddressExtractorTests.CorrectNumberOfAddressesAreExtractedFromFile() test, QuotesFilter will now extract test3@example.com out of "test3@example.com"

Also added a REVALIDATE enum option if a Filter wants to restart Filtering

GStefanowich commented 1 year ago

@troyhunt Could you update the EmailAddressInQuotesAreExtracted() test that I've created to make sure every expected address is pulled correctly?

https://github.com/HaveIBeenPwned/EmailAddressExtractor/blob/69eb4938b94f86b29382eb269888aada94d8e265/test/AddressExtractorTests.cs#L106-L113

Currently; "test1@example.com" returns test1@example.com ""test2@example.com"" FAILS 'test3@example.com' returns test3@example.com \"test4@example.com\" returns test4@example.com "test5"@example.com returns "test5"@example.com test6"@example.com FAILS

GStefanowich commented 1 year ago

Just added a catch statement while looping over files. Previously the only exception handling was done at the root of the file, so if running an ILineReader for a file extension throws an error trying to read, the entire program exits.

This is a bit of a problem if (currently an issue) your data dumps are taking an hour to process. Hitting an exception halfway through and cancelling the entire thing could be bad.

By default if an exception occurs you'll be prompted with a CONTINUE? with a y/n input, unless of course you've used the -y/--yes flag.

troyhunt commented 1 year ago

@troyhunt Could you update the EmailAddressInQuotesAreExtracted() test that I've created to make sure every expected address is pulled correctly?

Can you clarify what you mean by this @GStefanowich? (I may just be tired and missing something obvious...)

GStefanowich commented 1 year ago

@troyhunt Just looking for some peer review on what should pass the tests.

Everything I've found says that double quotes aren't valid email characters, so "test"@domain.com would technically fail- unless this should resolve to test@domain.com

With the current Regex- putting two sets of double quotes ""test@domain.com"" fails, even though I think test@domain.com should be captured from this? The problem here is that the Regex statement only captures ""test@domain.com", the second double-quote is outside of the bounds of the capture, and the quotes filter tests for matching pairs

Are the quotes part of the address, or should they just be removed?

troyhunt commented 1 year ago

I can't see a legitimate use case for anyone having a double quote in their address and I suggest that if they did, their life would probably be hell anyway! So we're talking about parsing out screwy formatting and yes I agree, I think an address should be successfully extracted from 2 sets of double quotes. In summary, if they're within the address then it fails, but if they encapsulate the address then it passes.