HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
64 stars 23 forks source link

Resolves #66 #67

Closed GStefanowich closed 9 months ago

GStefanowich commented 9 months ago

Resolves #66 - Simply modified the existing LengthFilter to also filter out bad Aliases.

As much as we're all Regex wizards, it's easier to simply filter out bad results after they've been caught than to try and wedge a {,64} into a new matching group.

Used the existing length filter where emails >256 were being filtered out

troyhunt commented 9 months ago

Love it! All for killing unnecessary regex complexity, the only hesitation I have is that where there are other places in the system that need to apply the same logic, there needs to be additional code and not just a regex update. But that's less important than actually making stuff work now 🙂

GStefanowich commented 9 months ago

What places in the system do you mean could also use the same logic?

The current design is to use one all-encompassing loose-on-strictness Regex query, and then filtering out the bad results using filters (possibly including additional more strict sub-Regex queries)

The more strict (or complex) the main Regex, the more cycles that are run on the raw content which brings old performance issues back up


Not that I'm not all for encouraging new contributors to try and take a crack at things

The filter system also allow new contributors to try out entirely new Regex queries for catching emails from raw input without breaking hardline rules

troyhunt commented 9 months ago

Other places include:

  1. When signing up for individual notifications
  2. When signing up for domain notifications
  3. When searching for a specific email address

I'm not overly precious about it needing to be regex though, I just need to maintain the logic across other parts of the system that aren't already OSS.