Closed troyhunt closed 1 year ago
Doing some testing on this, I created a 3.2G file of fake emails. Using File.ReadLinesAsync
will consistently read about ~360,000 lines/s. Switching to a StreamReader
will consistently read about ~400,000 lines/s, so it's a bit of an improvement.
Continuing to make some tweaks and investing further on what could possibly speed up the Regex Matches;
Taking out the Regex Match and just reading the file directly will do ~1,900,000 lines/s with the StreamReader
, where it'll do ~1,560,000 lines/s with the File.ReadLinesAsync
. So it definitely helps to use the StreamReader
File.ReadLinesAsync | StreamReader | |
---|---|---|
Raw lines | ~1,560,000/s | ~1,900,000/s |
Regexed lines | ~360,000/s | ~400,000/s |
The Regex really kills the lines/second
Hmmm, interesting, there's obviously a lot more to it than that. Here's the original method I used:
public List<string> GetEmailsFromString(string source, int maxEmailLength = 256)
{
var matches = _regex.Matches(source.Replace(@"\n", Environment.NewLine));
return (from Match m in matches select m.Value.ToLower())
.Where(e => e.Length <= maxEmailLength)
.Distinct()
.ToList();
}
Damn, you've nailed it! Here's the output of the last run before this change:
Extraction time: 517,508ms
Addresses extracted: 2,075,606
Read lines total: 2,075,967
Read lines rate: 4,011/s
And after:
Extraction time: 27,273ms
Addresses extracted: 2,075,594
Read lines total: 2,075,955
Read lines rate: 76,106/s
This is against the Terravision data set, I'll do some more tests then hopefully close this out.
Thanks for the suggestion @GStefanowich. Agree with the performance hit issue you commented in #20
The other way I can think of to use regex \.\.|\*|\.@|^\.|@-
to eliminate the matches. For rest of the cases we can manage it via specific code itself.
Regex | Description |
---|---|
\.\. |
Email having consecutive dot |
\* |
Email having * |
.@ |
Email having .@ |
@- |
Email having @- |
Demo: https://regex101.com/r/MK80iM/1
This also can lead to performance improvement.
Apologies for not turning this into a full pull-request (I rarely use Git, and then found that my latest-commit contained too many changes (formatting-edits, local "experiments")
The primary EmailRegex is a big speed-culprit (using backtrack) - I have modified this: 1 - By doing a "quick test" first, then the full "regex" 2 - By swapping the order of the "Filter out edge case addresses" (quick tests first, then the InvalidRegex afterwards)
If anyone wants to merge in my changes ( https://github.com/HaveIBeenPwned/EmailAddressExtractor/commit/2c18c12491a143a672fcf1ac66bff265e38434b9#diff-86dd578eb0b2b43365b4c838e1a97fa71229c9c1ab5753c5f1644bf874cb3b62 ) you should get a solid 10x performance boost
Just to give it a good back to back test, I just ran the current version of this app against the Terravision breach I've just loaded. This is a 379MB tab delimited text file with half a dozen fields and ~2M email addresses. Here are the results:
That's a massive perf hit. Just eyeballing the differences, I think it may come down to me using a
StreamReader
in the original implementation and the newer one usingFile.ReadLinesAsync
.To give everyone a good reference file without disclosing a whole heap of PII, I've used Red Gate's SQL Data Generator to create a test file with 10M rows: https://mega.nz/file/Ls8U1ADK#c1We1C_CZi44P0k3OB8YpNVN7HMM3gE_4-fH06E454c
I haven't benchmarked the old tool against this yet (I'll update this idea when I do), but there's definitely a massive improvement to be had.