HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
64 stars 23 forks source link

Performance enhancements #15

Closed troyhunt closed 1 year ago

troyhunt commented 1 year ago

Just to give it a good back to back test, I just ran the current version of this app against the Terravision breach I've just loaded. This is a 379MB tab delimited text file with half a dozen fields and ~2M email addresses. Here are the results:

  1. My crusty old tool: 1min 43sec
  2. The current version in this repo: 9min

That's a massive perf hit. Just eyeballing the differences, I think it may come down to me using a StreamReader in the original implementation and the newer one using File.ReadLinesAsync.

To give everyone a good reference file without disclosing a whole heap of PII, I've used Red Gate's SQL Data Generator to create a test file with 10M rows: https://mega.nz/file/Ls8U1ADK#c1We1C_CZi44P0k3OB8YpNVN7HMM3gE_4-fH06E454c

I haven't benchmarked the old tool against this yet (I'll update this idea when I do), but there's definitely a massive improvement to be had.

GStefanowich commented 1 year ago

Doing some testing on this, I created a 3.2G file of fake emails. Using File.ReadLinesAsync will consistently read about ~360,000 lines/s. Switching to a StreamReader will consistently read about ~400,000 lines/s, so it's a bit of an improvement.

Continuing to make some tweaks and investing further on what could possibly speed up the Regex Matches;

Taking out the Regex Match and just reading the file directly will do ~1,900,000 lines/s with the StreamReader, where it'll do ~1,560,000 lines/s with the File.ReadLinesAsync. So it definitely helps to use the StreamReader

File.ReadLinesAsync StreamReader
Raw lines ~1,560,000/s ~1,900,000/s
Regexed lines ~360,000/s ~400,000/s

The Regex really kills the lines/second

troyhunt commented 1 year ago

Hmmm, interesting, there's obviously a lot more to it than that. Here's the original method I used:

    public List<string> GetEmailsFromString(string source, int maxEmailLength = 256)
    {
      var matches = _regex.Matches(source.Replace(@"\n", Environment.NewLine));
      return (from Match m in matches select m.Value.ToLower())
        .Where(e => e.Length <= maxEmailLength)
        .Distinct()
        .ToList();
    }
troyhunt commented 1 year ago

Damn, you've nailed it! Here's the output of the last run before this change:

Extraction time: 517,508ms
Addresses extracted: 2,075,606
Read lines total: 2,075,967
Read lines rate: 4,011/s

And after:

Extraction time: 27,273ms
Addresses extracted: 2,075,594
Read lines total: 2,075,955
Read lines rate: 76,106/s

This is against the Terravision data set, I'll do some more tests then hopefully close this out.

hiteshbedre commented 1 year ago

Thanks for the suggestion @GStefanowich. Agree with the performance hit issue you commented in #20 The other way I can think of to use regex \.\.|\*|\.@|^\.|@- to eliminate the matches. For rest of the cases we can manage it via specific code itself.

Regex Description
\.\. Email having consecutive dot
\* Email having *
.@ Email having .@
@- Email having @-

Demo: https://regex101.com/r/MK80iM/1

This also can lead to performance improvement.

StevenWilmot commented 1 year ago

Apologies for not turning this into a full pull-request (I rarely use Git, and then found that my latest-commit contained too many changes (formatting-edits, local "experiments")

The primary EmailRegex is a big speed-culprit (using backtrack) - I have modified this: 1 - By doing a "quick test" first, then the full "regex" 2 - By swapping the order of the "Filter out edge case addresses" (quick tests first, then the InvalidRegex afterwards)

If anyone wants to merge in my changes ( https://github.com/HaveIBeenPwned/EmailAddressExtractor/commit/2c18c12491a143a672fcf1ac66bff265e38434b9#diff-86dd578eb0b2b43365b4c838e1a97fa71229c9c1ab5753c5f1644bf874cb3b62 ) you should get a solid 10x performance boost