HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
64 stars 23 forks source link

Performance boost due to splitting regex #53

Closed StevenWilmot closed 1 year ago

StevenWilmot commented 1 year ago

By splitting the Regex in two (first one using a NonBacktrack option, then second performing extra checks) the performance can be increased from approx 7000 emails/sec to approx 50000 emails/sec

The original logic remains intact, the new version just paraphrases "don't bother with the more intensive checks" if the quick-check fails.

GStefanowich commented 1 year ago

Modifying the main regex to be looser looks good. I haven't tested this change myself, but ~50k/s plus @jfbourke's suggestion on Channels could really speed things up.

Only suggestion I would make is possibly turning EmailRegex_Full into a Filter like the rest of the program design. Just so any future contributions don't turn into a bunch of Linq or Where clauses again.

Plus there might be more performance gain putting EmailRegex_Full behind the LengthFilter, so the Length of the loose capture is checked, and won't run another Regex if it exceeds.

[AddressFilter(Priority = 990)]
public sealed partial class StrictMatchFilter : AddressFilter.BaseFilter
{
    /// <summary>
    /// Email Regex pattern with full complex checks
    /// </summary>
    [GeneratedRegex(
        @"(\\"")?""?'?[a-z0-9\.\-\*!#$%&'+/=?^_`{|}~""\\]+(?<!\.)@([a-z0-9\-_]+\.)+[a-z0-9]{2,}\b(\\"")?""?'?(?<!\s)",
        RegexOptions.ExplicitCapture // Require naming captures; implies '(?:)' on groups. We don't make use of the groups
        | RegexOptions.IgnoreCase // Match upper and lower casing
        | RegexOptions.Compiled // Compile the nodes
        | RegexOptions.Singleline // Singleline mode
        | RegexOptions.CultureInvariant // Allow culture invariant character matching
    )]
    public static partial Regex StrictRegex();

    public override string Name => "Strict Match";

    /// <inheritdoc />
    public override Result ValidateEmailAddress(ref EmailAddress address)
        => this.Continue(StrictMatchFilter.StrictRegex()
            .IsMatch(address.Full));
}
GStefanowich commented 1 year ago

Old:

 - Read file x1 | Took 15.8m (at ~15.8m per)
   - Read line x10,000,001 | Took 15.8m (at ~95μs per)
   - Run regex x10,000,001 | Took 15.2m (at ~91μs per)
     - Generate capture x10,000,000 | Took 52s (at ~5μs per)
     - Check length     x10,000,000 | Took 3s (at ~0μs per)
     - Domain filter    x10,000,000 | Took 4s (at ~0μs per)
     - Filter invalids  x9,628,689 | Took 5s (at ~1μs per)
     - Filter quotes    x9,451,035 | Took 4s (at ~0μs per)
     - TLD Filter       x9,451,035 | Took 3s (at ~0μs per)

New:

 - Read file x1 | Took 1.7m (at ~1.7m per)
   - Read line x10,000,001 | Took 1.7m (at ~10μs per)
   - Run regex x10,000,001 | Took 1.0m (at ~6μs per)
     - Generate capture x10,000,000 | Took 8s (at ~1μs per)
     - Check length     x10,000,000 | Took 3s (at ~0μs per)
     - Strict Match     x10,000,000 | Took 9s (at ~1μs per)
     - Domain filter    x10,000,000 | Took 4s (at ~0μs per)
     - Filter invalids  x9,628,689 | Took 5s (at ~1μs per)
     - Filter quotes    x9,451,035 | Took 5s (at ~1μs per)
     - TLD Filter       x9,451,035 | Took 2s (at ~0μs per)