Closed GStefanowich closed 1 year ago
Fixed the AddressExtractorTests.CorrectNumberOfAddressesAreExtractedFromFile()
test, QuotesFilter
will now extract test3@example.com
out of "test3@example.com"
Also added a REVALIDATE
enum option if a Filter wants to restart Filtering
@troyhunt Could you update the EmailAddressInQuotesAreExtracted()
test that I've created to make sure every expected address is pulled correctly?
Currently;
"test1@example.com"
returns test1@example.com
""test2@example.com""
FAILS
'test3@example.com'
returns test3@example.com
\"test4@example.com\"
returns test4@example.com
"test5"@example.com
returns "test5"@example.com
test6"@example.com
FAILS
Just added a catch
statement while looping over files. Previously the only exception handling was done at the root of the file, so if running an ILineReader
for a file extension throws an error trying to read, the entire program exits.
This is a bit of a problem if (currently an issue) your data dumps are taking an hour to process. Hitting an exception halfway through and cancelling the entire thing could be bad.
By default if an exception occurs you'll be prompted with a CONTINUE?
with a y/n
input, unless of course you've used the -y/--yes
flag.
@troyhunt Could you update the
EmailAddressInQuotesAreExtracted()
test that I've created to make sure every expected address is pulled correctly?
Can you clarify what you mean by this @GStefanowich? (I may just be tired and missing something obvious...)
@troyhunt Just looking for some peer review on what should pass the tests.
Everything I've found says that double quotes aren't valid email characters, so "test"@domain.com
would technically fail- unless this should resolve to test@domain.com
With the current Regex- putting two sets of double quotes ""test@domain.com""
fails, even though I think test@domain.com
should be captured from this? The problem here is that the Regex statement only captures ""test@domain.com"
, the second double-quote is outside of the bounds of the capture, and the quotes filter tests for matching pairs
Are the quotes part of the address, or should they just be removed?
I can't see a legitimate use case for anyone having a double quote in their address and I suggest that if they did, their life would probably be hell anyway! So we're talking about parsing out screwy formatting and yes I agree, I think an address should be successfully extracted from 2 sets of double quotes. In summary, if they're within the address then it fails, but if they encapsulate the address then it passes.
This is kind of a big change, but I refactored
AddressExtractor.cs
so it wasn't such a hot path for contributors.Since extracting email addresses is the bread and butter of the operation, having everyone tweaking it constantly leads to a lot of merge conflicts, already extremely so with a small-ish amount of contributors.
This PR works to change
AddressExtractor
so that it doesn't need to be directly modified.I created a class called
BaseFilter
, which is an abstract base for allowing contributors to create their own address Filters.AddressExtractor
still contains the main Regex for capturing what an email is, but filters can do filtering operations where Regex catches false positives.The filter method implements two Methods:
If for some reason a Filter needs to run asynchronously it can, as the Async method is what is called by the
AddressExtractor
. A contributor can override either method to create their filter.The Address Filter returns a
Result
Enum:ALLOW
isn't of much use, but if a Filter wants to declare that a string should immediately succeed, it can do so.CONTINUE
means that it passes the filter, and will continue to run through the remaining filters, andDENY
fails the string as not a valid address.I've moved all of the current filters from the
for
loop into their own filters:With this change I appear to have broken the test
AddressExtractorTests.CorrectNumberOfAddressesAreExtractedFromFile()
? It's reading 11 address fromSingleSmallFile.txt
when it expects 12.One of the following addresses should succeed but I'm not sure which one?