HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
64 stars 23 forks source link

Recent commits have nuked performance #41

Closed troyhunt closed 1 year ago

troyhunt commented 1 year ago

I suspect this is related to regex changes, but the performance has massively dropped in the last day. I ran the current version over the CityJerks breach this morning, expecting it to take about 10 mins as it had done yesterday. It got to over an hour so I grabbed the stats as of the 5 min mark:

Extraction time: 300,001ms
Addresses extracted: 8,749
Read lines total: 10,193
Read lines rate: 33/s

I just branched off 67cc8afdc26a6adff4dc7fc79228c7250de1efec and ran it again:

Extraction time: 300,005ms
Addresses extracted: 47,284
Read lines total: 927,564
Read lines rate: 3,091/s

I'm going to create a separate issue to write some tests around performance because it's so easy to make a regex change and not realise the perf impact once it runs over a sizeable data set.

troyhunt commented 1 year ago

Quick bump on this for @GStefanowich and @jfbourke in particular given how much recent work you've put into this project. This issue is a real show-stopper right now, I can't really run the app against any data sets to test further whilst perf is suffering this bad. Any thoughts on a fix?

GStefanowich commented 1 year ago

There was a good suggestion in #15, it might be best to revert some of the changes to the main Regex and apply more filters instead, since Regex is the most costly. GeneratedRegex has a NoBacktracking option which is supposed to help performance on large searches especially, but it cant be used because the Regex uses backtracking with the (?<) captures somewhere

GStefanowich commented 1 year ago

@troyhunt Running the current latest master this is the performance that I'm seeing from the test data dump:

Extraction time: 1.0m
Addresses extracted: 590,944
Read lines total: 590,944
Read lines rate: 9,846/s

Extraction time: 2.0m
Addresses extracted: 1,185,119
Read lines total: 1,185,119
Read lines rate: 9,875/s

End result:

 - Read file x1 | Took 15.8m (at ~15.8m per)
   - Read line x10,000,001 | Took 15.8m (at ~95μs per)
   - Run regex x10,000,001 | Took 15.2m (at ~91μs per)
     - Generate capture x10,000,000 | Took 52s (at ~5μs per)
     - Check length     x10,000,000 | Took 3s (at ~0μs per)
     - Domain filter    x10,000,000 | Took 4s (at ~0μs per)
     - Filter invalids  x9,628,689 | Took 5s (at ~1μs per)
     - Filter quotes    x9,451,035 | Took 4s (at ~0μs per)
     - TLD Filter       x9,451,035 | Took 3s (at ~0μs per)
jfbourke commented 1 year ago

@troyhunt do you have any constraints on CPU/Memory usage? I've been looking into https://learn.microsoft.com/en-us/dotnet/core/extensions/channels and in a feature branch I am seeing timings between 49s and around 2.5 mins depending on the number of consumer threads and queue length set. CPU/Memory utilization is obviously higher for the faster run, which was an unbounded channel.

As an example, with a bounded channel set to 100K message capacity with 6 consumers I see around 100MB of memory and approx 35% CPU with below timings (my rig has 24 cores for reference)

Extraction time: 1.0m
Addresses extracted: 5,401,425
Read lines total: 5,401,425
Read lines rate: 90,020/s

Extraction time: 1.9m
Addresses extracted: 10,000,001
Read lines total: 10,000,001
Read lines rate: 88,466/s

At this stage, very much PoC code.

GStefanowich commented 1 year ago

@jfbourke Using channels is a good call

GStefanowich commented 1 year ago

You could also set up some additional commandline parameters for tweaking the limits of the channel

GStefanowich commented 1 year ago

Even without channels, with #53 I'm now seeing this:

Extraction time: 1.0m
Addresses extracted: 5,501,653
Read lines total: 5,501,654
Read lines rate: 91,682/s
jfbourke commented 1 year ago

Yep, with #53 I am getting the below timings, I'll hold back the channel changes for now. I would suggest looking into streaming the matches into addresses_output.txt file to reduce memory pressure. @troyhunt have we lost accuracy at all (are we expecting 10 million addresses)?

Extraction time: 31s
Addresses extracted: 9,451,035
Read lines total: 9,451,035
Read lines rate: 309,413/s
GStefanowich commented 1 year ago

@jfbourke I've implemented channels in #55 , you must either have a faster disk or a faster CPU to be doing 309,413/s from just #53

Can you clone #55 and see what you get?

jfbourke commented 1 year ago

@GStefanowich I get the below (saving to disk takes about half the time again), CPU was around 15% and memory maxed out around 2GB.

Extracting...
Finished reading files
Extraction time: 25s
Addresses extracted: 9,451,035
Read lines total: 9,451,035
Read lines rate: 371,036/s

Saving to disk..
GStefanowich commented 1 year ago

@jfbourke Thanks

Reducing RAM usage might be good, just difficult since the extracted emails are sorted, would have to use temporary files or something. Change to IO instead of RAM.

That's out of my wheelhouse if you or someone else wants to take a stab at that

troyhunt commented 1 year ago

@troyhunt have we lost accuracy at all (are we expecting 10 million addresses)?

Yep, something has been lost as there are exactly 10M email addresses in the sample set.

GStefanowich commented 1 year ago

@troyhunt Does the sample set contain invalid domains?

I ran the test and logged the email that are being filtered out:

lrcz.ukwm@-ikofv.com
pkbamik.pudnznllq@-fzlmz.org
mewrldep.fkgktzu@-bunqy.org
jelf7@-itfkx.org
bnyepzl@-coloc.net
itogruz.lmxurhy@---qle.org
bqgs.wilchl@-giphr.com
lnxcidg5@-j-gfu.org
jjxatts.edoweyip@-vlhzl.org
thlac.tgogbesg@-dspgs.net
rtivgjdx.ifyfuq@-pubwq.net
ozocxqe.olgauk@-wuzqm.com
dpug@-jqstw.net
kiecj717@-pvppz.com
jvqx.dvrsrte@-sbnks.net
nsfjwdsp7@-w-huy.com
hepm.sqnhfkd@-nhawi.org
ptob@-mxtgt.org
stvd.unrchtfhbu@-obqkg.net
belau3@-qcirl.net

It's all domains starting with a hyphen

troyhunt commented 1 year ago

Ooh, that's a really good point. I generated all the sample data with Red Gate's SQL Data Generator so it's entirely possible their pattern for generating addresses creates entries that fail our criteria. Let me look closer, I might just strip out the leading dashes on domains so we have one clean set of 10M which will make it much easier to see at a glance if they're all extracted.

troyhunt commented 1 year ago

Ok, I've just cleaned the test data up and published a V2 which results in exactly 10M email addresses being extracted. Get it from here: https://mega.nz/file/Xk91ETzb#UYklfa84pLs5OzrysEGNFVMbFb5OC0KU7rlnugF_Aps

troyhunt commented 1 year ago

I think perf is great as of this moment, I'm just going to leave this idea open until I get home in a couple of days and can run it against the CityJerks data again so we have a true back-to-back test against the same data set.

troyhunt commented 1 year ago

Home again and ran this over the same breach as in the original post:

[14/05/2023 16:17:13] Extraction time: 3.1m
[14/05/2023 16:17:13] Addresses extracted: 8,909
[14/05/2023 16:17:13] Read lines total: 11,270
[14/05/2023 16:17:13] Read lines rate: 60/s

Total run time was 23.8 minutes so we're still really suffering performance wise here given it was about 10 minutes earlier on. If someone would like to take a look at this, I'm going to leave the issue open until we can get back to comparable processing times to before.

jfbourke commented 1 year ago

@troyhunt how similar is the CityJerks file to the RedGate generated one?

For the latest RedGate file I am getting the below, what timings are you seeing? What machine specs are you running on?

Finished reading files
Extraction time: 28s
Addresses extracted: 10,000,000
Read lines total: 10,000,000
Read lines rate: 363,042/s
jfbourke commented 1 year ago

Ok, just ran the RedGate file on a Raspberry Pi and it took 45 mins. There is a considerable drop-off in performance during processing.

report

Commenting out the TryAdd to the Addresses dictionary results in the below timings but this is not completely valid as the addresses would need to be streamed to the output file and duplicates might be present;

Finished reading files
Extraction time: 4.6m
Addresses extracted: 0 <--- this is zero due to not holding in memory
Read lines total: 10,000,000
Read lines rate: 36,302/s

Changing to a HashSet results in 13min execution time, but as a lock is needed we still get a perf drop off.

Extraction time: 1.0m
Addresses extracted: 2,085,276
Read lines total: 2,085,279
Read lines rate: 34,141/s
. . .
Extraction time: 13.0m
Addresses extracted: 10,000,000
Read lines total: 10,000,000
Read lines rate: 12,820/s

Is post processing for duplicates a possibility?

GStefanowich commented 1 year ago

@jfbourke Would you mind cloning 1437145170a47c76f5be8561142f633106904524 on your rpi to see how it does? I don't have one handy to do so

git clone -b performance --single-branch git@github.com:GStefanowich/EmailAddressExtractor.git

It probably needs some tweaking of its own, since its a solution that possibly takes up some more RAM with everything running at the same time- instead of doing the de-duplicating at the end.

Swapping out the concurrent dictionary for the hashset directly might end up with concurrency issues. This adds another Channel and a Task that does the deduplication. The Channel handles the concurrency and the single Task not having to be concurrent can use the hashset just fine.

Running the test data this way on my PC drops from about 6.2m to 3.4m

 - Run regex x10,000,001 | Took 6.2m (at ~37μs per)
 - Read file x1 | Took 49s (at ~49s per)
   - Read line x10,000,001 | Took 48s (at ~5μs per)
 - Run regex x10,000,001 | Took 3.4m (at ~20μs per)
 - Read file x1 | Took 27s (at ~27s per)
   - Read line x10,000,001 | Took 26s (at ~3μs per)
jfbourke commented 1 year ago

@GStefanowich looks to run about the same time as using hashset with a lock wrapper and exhibits the same slowdown.

Extraction time: 1.0m
Addresses extracted: 2,020,311
Read lines total: 2,020,951
Read lines rate: 33,626/s
Extraction time: 2.0m
Addresses extracted: 3,114,129
Read lines total: 3,114,130
Read lines rate: 25,942/s
. . .
Extraction time: 6.0m
Addresses extracted: 6,282,357
Read lines total: 6,282,358
Read lines rate: 17,357/s
Extraction time: 7.0m
. . .
Extraction time: 13.0m
Addresses extracted: 9,781,787
Read lines total: 9,781,799
Read lines rate: 12,507/s
Extraction time: 13.4m
Addresses extracted: 10,000,000
Read lines total: 10,000,001
Read lines rate: 12,462/s
Extraction time: 14.0m
Addresses extracted: 10,000,000
Read lines total: 10,000,001
Read lines rate: 11,903/s
troyhunt commented 1 year ago

Sorry for the delay on this guys, I fell behind a bit.

Ok, couple of things first being that running this on my desktop in debug mode directly from VS resulted in the following against the generated data:

[23/05/2023 10:01:03] Extraction time: 4.2m
[23/05/2023 10:01:03] Addresses extracted: 10,000,000
[23/05/2023 10:01:03] Read lines total: 10,000,000
[23/05/2023 10:01:03] Read lines rate: 40,097/s

Running from the command line directly against a release build gave me this:

[23/05/2023 10:04:53] Finished reading files
[23/05/2023 10:04:53] Extraction time: 1.2m
[23/05/2023 10:04:53] Addresses extracted: 10,000,000
[23/05/2023 10:04:53] Read lines total: 10,000,000
[23/05/2023 10:04:53] Read lines rate: 138,653/s

Run back-to-back on my laptop, I got a total extraction time of 1.0m. Obviously lots of variables involved (I assume disk speed plays a big part), so I put the data on a different disk in the same machine and saw... 1.0m 🤷‍♂️

So it should probably go without saying, but it'd be nice to have some benchmarking guidelines (and I really should have picked up on this earlier too...)

Getting back to the CityJerks data from earlier, it's a comparable format to the test data but obviously different content. I'd expect the timing to be similar in terms of rate, but it's a 13GB file so obviously longer overall. I just ran it again now (14.7m on my desktop) but we really don't have an easily comparable output as it's changed between versions of this app.

In summary, I think we're all good now so I'm going to close this issue off, a bit of guidance around the best way to run this for optimum performance would be good (maybe even including listing variables such as disk speed).