Closed StevenWilmot closed 1 year ago
Modifying the main regex to be looser looks good. I haven't tested this change myself, but ~50k/s plus @jfbourke's suggestion on Channels could really speed things up.
Only suggestion I would make is possibly turning EmailRegex_Full
into a Filter like the rest of the program design. Just so any future contributions don't turn into a bunch of Linq
or Where
clauses again.
Plus there might be more performance gain putting EmailRegex_Full
behind the LengthFilter
, so the Length of the loose capture is checked, and won't run another Regex if it exceeds.
[AddressFilter(Priority = 990)]
public sealed partial class StrictMatchFilter : AddressFilter.BaseFilter
{
/// <summary>
/// Email Regex pattern with full complex checks
/// </summary>
[GeneratedRegex(
@"(\\"")?""?'?[a-z0-9\.\-\*!#$%&'+/=?^_`{|}~""\\]+(?<!\.)@([a-z0-9\-_]+\.)+[a-z0-9]{2,}\b(\\"")?""?'?(?<!\s)",
RegexOptions.ExplicitCapture // Require naming captures; implies '(?:)' on groups. We don't make use of the groups
| RegexOptions.IgnoreCase // Match upper and lower casing
| RegexOptions.Compiled // Compile the nodes
| RegexOptions.Singleline // Singleline mode
| RegexOptions.CultureInvariant // Allow culture invariant character matching
)]
public static partial Regex StrictRegex();
public override string Name => "Strict Match";
/// <inheritdoc />
public override Result ValidateEmailAddress(ref EmailAddress address)
=> this.Continue(StrictMatchFilter.StrictRegex()
.IsMatch(address.Full));
}
Old:
- Read file x1 | Took 15.8m (at ~15.8m per)
- Read line x10,000,001 | Took 15.8m (at ~95μs per)
- Run regex x10,000,001 | Took 15.2m (at ~91μs per)
- Generate capture x10,000,000 | Took 52s (at ~5μs per)
- Check length x10,000,000 | Took 3s (at ~0μs per)
- Domain filter x10,000,000 | Took 4s (at ~0μs per)
- Filter invalids x9,628,689 | Took 5s (at ~1μs per)
- Filter quotes x9,451,035 | Took 4s (at ~0μs per)
- TLD Filter x9,451,035 | Took 3s (at ~0μs per)
New:
- Read file x1 | Took 1.7m (at ~1.7m per)
- Read line x10,000,001 | Took 1.7m (at ~10μs per)
- Run regex x10,000,001 | Took 1.0m (at ~6μs per)
- Generate capture x10,000,000 | Took 8s (at ~1μs per)
- Check length x10,000,000 | Took 3s (at ~0μs per)
- Strict Match x10,000,000 | Took 9s (at ~1μs per)
- Domain filter x10,000,000 | Took 4s (at ~0μs per)
- Filter invalids x9,628,689 | Took 5s (at ~1μs per)
- Filter quotes x9,451,035 | Took 5s (at ~1μs per)
- TLD Filter x9,451,035 | Took 2s (at ~0μs per)
By splitting the Regex in two (first one using a NonBacktrack option, then second performing extra checks) the performance can be increased from approx 7000 emails/sec to approx 50000 emails/sec
The original logic remains intact, the new version just paraphrases "don't bother with the more intensive checks" if the quick-check fails.