HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
64 stars 23 forks source link

Exclude common file extensions that could be mistaken for TLDs #40

Closed troyhunt closed 1 year ago

troyhunt commented 1 year ago

Processing a breach just now, the last "email address" in the corpus was _v-16x9@2dxl_-77ed5d09bafd4e3cf6a5a0264e5e16ea35f14925.jpg which is obviously not a valid address as ".jpg" is not a valid TLD: https://data.iana.org/TLD/tlds-alpha-by-domain.txt

I've added a failing test for this specific case in https://github.com/HaveIBeenPwned/EmailAddressExtractor/commit/9c588fdf90bb296fdb7e34fce3de798cf9848f42. I know this will never be perfect, but given breaches sometimes contain complete website dumps or references to corporate docs, I suggest the vast number of false positives could be quite easily avoided. I suggest the following and am open to any other thoughts on it:

  1. Create a list of the most common file extensions you'd see in breaches (.jpg, .sql, .txt, .html, .docx, etc)
  2. Check that none of them exist in the IANA list above
  3. Either add them to configuration or statically code them as we've done with FileExtensionParsing

To be clear, this is only an issue when the file extension appears in the same string as other valid email characters separated by an @ symbol so it's definitely not major, but this seems like an easy win that addresses a specific instance of this false positive filtering through into a loaded data breach.