HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
64 stars 23 forks source link

Replaces '.Replace' calls with a Regex #72

Open GStefanowich opened 5 months ago

GStefanowich commented 5 months ago

I removed the block of .Replace calls from the AddressExtractor class.

Not that this is a corporate environment with a bunch of people editing over eachother, but I wrote the filter system so that there wasn't a spaghetti mess of things overlapping (Including people trying to make overlapping edits). Each filter either accepts or denies an email as valid.


I replaced the block of Replaces with a new Filter, so if an address starts with an illegal character ('!`{#\n\) it'll trim them from the start. (Rather than a full replace).

Then if there was a replace it'll return REVALIDATE which will refilter it against other filters to make sure it's still valid.


With filters if one filter removes an illegal character, eg -, and another filter checks for valid domains, then you might have something where you pass in the email @my-test.dom-ain. The first filter might remove our "illegal" - character which results with @mytest.domain, and then the second filter will deny .domain for being an invalid TLD.


For the remaining test you last added EmailAddressesInExtracted with the email foo|test@example.com|bar, the Filter system needs to be modified because you can only currently return one modified Email address. A filter could see | as an illegal character, run a .Split('|') for ["foo", "test@example.com", "bar"] and then run each through the filter system again, where foo and bar would be skipped.

troyhunt commented 5 months ago

With issue https://github.com/HaveIBeenPwned/EmailAddressExtractor/issues/61 in mind, wouldn't it be better just to treat any character not in the allow list as a terminating string that denotes either the start or the end of the address? I can't think of a downside to this, not given the specific criteria required for the email address pattern that occurs within those characters.

GStefanowich commented 5 months ago

That's ultimately up to you on how you want to handle it.

It'll muddle up the main Regex more, and goes a bit against the design of "Loose main Regex, filter out the bad results" by making the main Regex stricter.

It could perform better with less Regexes, or it could perform better by quickly capturing loose results and only running the "Heavy" Regexes fewer times. I'm not sure and not about to implement both ways to test.