Closed hiteshbedre closed 1 year ago
Regexes are one of these where there is no one correct answer. Everyone has a different take, and that is just how they seem to be designed. And I will admit I am no magician in regex. I have generally tried to avoid them in my professional career.
I've got it down to one test fail (unescaped_double_quote_is_invalid
), but it does need some code filtering along with the regex. The regex expression simply cannot do it all by itself.
public List<string> ExtractAddresses(string content)
{
if (string.IsNullOrWhiteSpace(content))
return new();
// Expression allows some false positives through. These will be filtered later.
string addressPattern = @"[\p{L}\d\-_\+\/\&\!\%""\\\.\*]+[^\.]@[\p{L}\d][\p{L}\d\.\-_]*\.[a-z]{2,}\b";
var matches = Regex.Matches(content, addressPattern, RegexOptions.IgnoreCase);
var uniqueAddresses = new HashSet<string>();
foreach (Match match in matches)
{
var address = match.Value;
// Filter out the false positives that the regex expression could not handle
if (address[0] == '.')
continue;
if (address.Contains('*'))
continue;
if (address.Contains(".."))
continue;
if (address.Length >= 256)
continue;
uniqueAddresses.Add(match.Value.ToLower());
}
return uniqueAddresses.OrderBy(a => a).ToList();
}
Regexes are one of these where there is no one correct answer.
Completely agree with your statement.
accented characters
Its a new learning for me. Thanks for sharing, updated the pull request accordingly.
Good progress @hiteshbedre, thank you! Still got a few failing legacy tests, want to take a stab at those?
Yeah sure.
Handled most of the cases mentioned under "LegacyTests.cs" file.
Mary&Jane@example.org
""test\""blah""@example.com
customer/department@example.com
$A12345@example.com
!def!xyz%abc@example.com
_Yosemite.Sam@example.com
Ima.Fool@example.com
foobar@x.com
foobar@c0m.com
foobar@c_m.com
foo@bar_.com
foo@666.com
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111@example.com
Invalid cases handled:
char)8 + "ar.com
char)9 + "ar.com
char)127 + "ar.com
.wooly@example.com
pootietang.@example.com
.@example.com
foo@bar
foo@bar.a
Pattern not being handled via Regex:
pootietang.@example.com
wo..oly@example.com
foo@bar.1com
foobar@_.com
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111@example.com
Fix for issue https://github.com/HaveIBeenPwned/EmailAddressExtractor/issues/5