HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
64 stars 23 forks source link

Create tests to measure performance #42

Open troyhunt opened 1 year ago

troyhunt commented 1 year ago

This is related to #41 but I thought I'd break it out into its own issue as it's a discrete unit of work. Without tests, it's very hard to tell if even a really minor tweak to a regex (or similar) is detrimental to performance. I'm not sure of the best way to do this, but we need even a really rough order of magnitude test suite that can pick up on this sort of thing. The change referenced in the aforementioned issue has hit perf by up to 1000x (33 emails per sec in the 300s report versus 3,091 before the change), so I imagine it won't be hard to detect something that varies perf by that much. Hopefully we can get a more finely tuned measurement model together and even start chipping away further at the best perf stats we've achieved.

jaimevisser commented 1 year ago

It would be interesting to have some lifelike dataset to run tests against.

troyhunt commented 1 year ago

I actually generated a set of sample data back in #15 which I've now added to the readme. For brevity, it's here: https://mega.nz/file/Ls8U1ADK#c1We1C_CZi44P0k3OB8YpNVN7HMM3gE_4-fH06E454c