HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
68 stars 23 forks source link

Updated email address matching pattern #9

Closed hiteshbedre closed 1 year ago

hiteshbedre commented 1 year ago

Handled most of the cases mentioned under "LegacyTests.cs" file.

Sample Email Correctly Matched
Mary&Jane@example.org
""test\""blah""@example.com
customer/department@example.com
$A12345@example.com
!def!xyz%abc@example.com
_Yosemite.Sam@example.com
Ima.Fool@example.com
foobar@x.com
foobar@c0m.com
foobar@c_m.com
foo@bar_.com
foo@666.com
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111@example.com

Invalid cases handled:

Invalid Email Address Regex Invalidating?
char)8 + "ar.com
char)9 + "ar.com
char)127 + "ar.com
.wooly@example.com
pootietang.@example.com
.@example.com
foo@bar
foo@bar.a

Pattern not being handled via Regex:

Email Description Handled via Code?
pootietang.@example.com dot_before_at
wo..oly@example.com consecutive_dots
foo@bar.1com tld_starting_with_number
foobar@_.com domain_with_underscore
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111@example.com email_of_256_chars

Fix for issue https://github.com/HaveIBeenPwned/EmailAddressExtractor/issues/5

KonajuGames commented 1 year ago

Regexes are one of these where there is no one correct answer. Everyone has a different take, and that is just how they seem to be designed. And I will admit I am no magician in regex. I have generally tried to avoid them in my professional career.

I've got it down to one test fail (unescaped_double_quote_is_invalid), but it does need some code filtering along with the regex. The regex expression simply cannot do it all by itself.

        public List<string> ExtractAddresses(string content)
        {
            if (string.IsNullOrWhiteSpace(content))
                return new();
            // Expression allows some false positives through. These will be filtered later.
            string addressPattern = @"[\p{L}\d\-_\+\/\&\!\%""\\\.\*]+[^\.]@[\p{L}\d][\p{L}\d\.\-_]*\.[a-z]{2,}\b";
            var matches = Regex.Matches(content, addressPattern, RegexOptions.IgnoreCase);
            var uniqueAddresses = new HashSet<string>();

            foreach (Match match in matches)
            {
                var address = match.Value;
                // Filter out the false positives that the regex expression could not handle
                if (address[0] == '.')
                    continue;
                if (address.Contains('*'))
                    continue;
                if (address.Contains(".."))
                    continue;
                if (address.Length >= 256)
                    continue;
                uniqueAddresses.Add(match.Value.ToLower());
            }

            return uniqueAddresses.OrderBy(a => a).ToList();
        }
hiteshbedre commented 1 year ago

Regexes are one of these where there is no one correct answer.

Completely agree with your statement.

hiteshbedre commented 1 year ago

accented characters

Its a new learning for me. Thanks for sharing, updated the pull request accordingly.

troyhunt commented 1 year ago

Good progress @hiteshbedre, thank you! Still got a few failing legacy tests, want to take a stab at those?

hiteshbedre commented 1 year ago

Yeah sure.