Processing a breach just now, the last "email address" in the corpus was _v-16x9@2dxl_-77ed5d09bafd4e3cf6a5a0264e5e16ea35f14925.jpg which is obviously not a valid address as ".jpg" is not a valid TLD: https://data.iana.org/TLD/tlds-alpha-by-domain.txt
I've added a failing test for this specific case in https://github.com/HaveIBeenPwned/EmailAddressExtractor/commit/9c588fdf90bb296fdb7e34fce3de798cf9848f42. I know this will never be perfect, but given breaches sometimes contain complete website dumps or references to corporate docs, I suggest the vast number of false positives could be quite easily avoided. I suggest the following and am open to any other thoughts on it:
Create a list of the most common file extensions you'd see in breaches (.jpg, .sql, .txt, .html, .docx, etc)
Check that none of them exist in the IANA list above
Either add them to configuration or statically code them as we've done with FileExtensionParsing
To be clear, this is only an issue when the file extension appears in the same string as other valid email characters separated by an @ symbol so it's definitely not major, but this seems like an easy win that addresses a specific instance of this false positive filtering through into a loaded data breach.
Processing a breach just now, the last "email address" in the corpus was
_v-16x9@2dxl_-77ed5d09bafd4e3cf6a5a0264e5e16ea35f14925.jpg
which is obviously not a valid address as ".jpg" is not a valid TLD: https://data.iana.org/TLD/tlds-alpha-by-domain.txtI've added a failing test for this specific case in https://github.com/HaveIBeenPwned/EmailAddressExtractor/commit/9c588fdf90bb296fdb7e34fce3de798cf9848f42. I know this will never be perfect, but given breaches sometimes contain complete website dumps or references to corporate docs, I suggest the vast number of false positives could be quite easily avoided. I suggest the following and am open to any other thoughts on it:
FileExtensionParsing
To be clear, this is only an issue when the file extension appears in the same string as other valid email characters separated by an @ symbol so it's definitely not major, but this seems like an easy win that addresses a specific instance of this false positive filtering through into a loaded data breach.