HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
68 stars 23 forks source link

Email addresses in URLs should be extracted #63

Closed troyhunt closed 2 months ago

troyhunt commented 1 year ago

Just processing a breach with a bunch of data that looks like this:

image

That's the output after running the extractor and the problem is that the string "https://example.com/path/test@example.com" is extracting "//example.com/path/test@example.com" as the email address. I think this is another one of those cases where regardless of the spec, a forward slash should be treated as a word break, I just can't think of legitimate cases where there'd be a forward slash in a real email address.

There's now a failing test for this in da5b1f265a4fc6e8437509d92b13a057e29a8de1.

GStefanowich commented 1 year ago

This should be a simple change by removing the / from the Regex ('+/=? to '+=?)

https://github.com/HaveIBeenPwned/EmailAddressExtractor/blob/da5b1f265a4fc6e8437509d92b13a057e29a8de1/src/AddressExtractor.cs#L15

Johno-ACSLive commented 2 months ago

@troyhunt I was testing this out and it seems to parse correctly. Is this issue still valid?

troyhunt commented 2 months ago

Yep, you're right, it looks good now. Thanks for the heads up, I'll close this issue now.