HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
64 stars 23 forks source link

[.net7] Improvements to performance #8

Closed GStefanowich closed 1 year ago

GStefanowich commented 1 year ago

This does change the framework from .net6 to .net7 so not sure if you want to do that.

.net7 introduced the GeneratedRegex Attribute, which is supposed to help Regex performance. It pregenerates the Regex nodes instead of calling the Regex.Matches on every loop (Honestly not sure if the Regex library does any static caching).

I also introduced the RegexOptions.IgnoreCase to remove the upper casing from [a-zA-Z0-9] -> [a-z0-9] matches. The IgnoreCase enum adds some performance improvements in .net7 with matching, as it now uses char is "A" or "a" matching instead of the overhead of char.ToLower() == "a". Though I also replaced the a-z matches with \p{L} to catch special characters that aren't just alphabetic characters. If an email or a domain has umlauts or accented characters it'll match them.

Lastly I changed the File read to be asynchronous and read line-by-line instead of reading the whole file all at once. Since your regular expression wasn't doing any multiline matchings, there's no point in reading the entire file before beginning to search. This will save some on RAM if you're reading large files all at once.

troyhunt commented 1 year ago

I'm perfectly happy to upgrade, thanks for the suggestion. Want to rebase from latest and merge with the other changes then I'll take the PR?