Open troyhunt opened 1 year ago
Those are all completely valid characters to be used in email addresses, per RFC 5322. The only requirement for the local part of the domain with respect to first character is that it MUST not be a .
.
That said, I am not really aware of any providers which permit this as an address or alias, so.. perhaps it's OK to drop them?
I might be inclined to include both stripped and non-stripped.
They are valid characters, but most email address validation will fail them.
I believe only the . is restricted from being the first or last character of the local-part, so I think you've gotta take the weirdos.
"Without quotes, local-parts may consist of any combination of alphabetic characters, digits, or any of the special characters
! # $ % & ' * + - / = ? ^ _ ` . { | } ~
period (".") may also appear, but may not be used to start or end the local part, nor may two or more consecutive periods appear."
Is there a pattern in the domain portion of the address? If it's all some set of custom domains, maybe they are valid. If it's gmail/yahoo/hotmail/outlook check the service's sign up rules, where those names are likely invalid.
Those are all completely valid characters to be used in email addresses, per RFC 5322.
Perhaps, but I suspect that in this case those chars aren't a valid part of the address and are instead delimiters that have been inadvertently included. That leaves us with trading off between either strictly adhering to the RFC and excluding those addresses, or parsing them out and including them despite the strict RFC definition. Which is better?
I might be inclined to include both stripped and non-stripped.
That gets very messy as it doubles up on the addresses which then affects all sorts of counts and stats. I'm really reticent to do that and would prefer to pick one approach over the other.
Is there a pattern in the domain portion of the address? If it's all some set of custom domains, maybe they are valid. If it's gmail/yahoo/hotmail/outlook check the service's sign up rules, where those names are likely invalid.
No, lots of different domains, I'm highly confident we're looking at parsing issues, not deliberately funky addresses.
I just sent myself an email starting with {
from Gmail to Exim. All of the servers involved (including gmail, exim, rspamd, and dovecot) had no issues whatsoever.
I'd also advocate for adhering to RFC 3696 wherever possible. I've seen issues when systems reject perfectly legitimate but rarely seen email addresses. One of my emails has the super-short format of x@x.yy
. It's my real email, but is often rejected by web form validation not respecting the RFC.
For good measure, I also tested with Exchange (MS365) and Roundcube. No issues with either one. It seems like a definite edge case, but still worthwhile to allow all valid email addresses.
If my email address is john.doe@example.com
, I would like to be alerted if {john.doe@example.com
appears in a data breach. That said, there is the tiny possibility that {john.doe@example.com
is the correct address, so I think including both versions would be the best option.
Even though these are valid characters, I feel more often then not they are probably going to be delimiters. If including them means I won't be alerted on a@b.c when you hit {a@b.c it'd be best to strip them.
I do think the question again becomes; should you handle this in this tool or somewhere else in the HIBP infrastructure?
I’m inclined to rephrase this: is it better to include characters that are almost certainly intended to be delimiters therefor excluding the correct address from HIBP, or strip the characters and include the correct address regardless of the RFC allowing them?
👆This. I think we have about five nines certainty that these are other peoples' failure to parse lists properly.
Do the email providers for these addresses support creating an address with the leading character? If not, then as written they could not possibly be correct and you can strip the character without fear.
That leaves us with trading off between either strictly adhering to the RFC and excluding those addresses, or parsing them out and including them despite the strict RFC definition. Which is better?
Timely, I've just been asking similar questions on email around "do I respect the RFCs or do I act sensibly / force others to act sensibly". Before doing research I was of the opinion of "force them to act sensibly".
Using https://mailchimp.com/resources/most-used-email-service-providers/ as a reference.
tldr: strip them
Given that none of the above major email providers let you signup using these characters, I think it is a fair assumption that these are fail delimiters. In the (probably rare) case that it is not a failed delimiter then you should have been sensible and unfortunately you will miss out on this HIBP notification.
That said, I think there might be cases were it legitimately appears later in the address (i.e. where email providers allow the <mailbox>(+<label>)@<domain>
format).
Aside: this ends up being a rabbit hole if you try to normalise it as some providers let you use .
in the address which are stripped out when determining the actual mailbox to deliver to - but not all providers.
The RFC basically says it's up to the provider to how they handle it, so unless you are going to start checking MX records or probing SMTP server headers you can't really make any assumptions of what you can and can't normalise in an email address for sending.
The sending part is important in this context because (I assume) it will be used for notifications. Contrast that with searching on the web UI where you could have an index of normalised email addresses to help users who have taken advantage of things like .
stripping or +
labels from their email provider.
That gets very messy as it doubles up on the addresses which then affects all sorts of counts and stats. I'm really reticent to do that and would prefer to pick one approach over the other.
If there is a concern about users not receiving notifications could we (and by we I mean @troyhunt):
That was a really useful set of screen caps @nhairs, thank you!
I think the only reasonable conclusion here is that due to the lack of support for these chars by mainstream email providers, the unlikelihood of anyone legitimately using them and the real world impact of an email address being missed, the chars should be stripped and the RFC be damned 🙂
That now makes this issue a feature request. I'll add some tests for this later, I need to go back an analyse precisely how those strings were formatted originally.
You might consider implementing a feature switch that looks like --mode rfc3696
where this would allow users to have their RFC3696 strictness when wanted; setting it up this way makes it possible to implement other "modes" at a later time too.
I've just added a failing test for this (EmailAddressesInSingleQuotesWithTrailingSpaceIsExtracted) based on the following in the most recent breach I loaded:
This subsequently extracted an address that began with a single quote and ended with the correct .br TLD. Let's add single quotes to the list of chars in the issue above and get this done.
Related: You may want to check for emails in the form of "asdf"@example.com
(and potentially the single quote version).
These are valid emails according to the RFCs (though I'm not super clear on where they can still be used).
For more info I'd check out the RFCs in the comments for quoted_local_part
in this library
Related: You may want to check for emails in the form of
"asdf"@example.com
(and potentially the single quote version).
I've just updated the readme with a "practical considerations" section that addresses this: https://github.com/HaveIBeenPwned/EmailAddressExtractor/blob/main/README.md
I've just commited https://github.com/HaveIBeenPwned/EmailAddressExtractor/commit/cafb50344484f38ff46c9aef3592c718862c2ada which is a kludgy fix for some common problems I was seeing. I've then added https://github.com/HaveIBeenPwned/EmailAddressExtractor/commit/e73dd180dc1696292f767c24fa0a088d44956df8 which caused a heap of dramas processing the (still alleged) AT&T breach to the point where I had to defer to the old script.
This all looks related to problems with the delimiters and my view on it now is that we just take a strict approach. Would be awesome if someone took a good shot at fixing this.
Why not using a regex? Emailregex.com says, based on the RFC5322, that some of the characters you are replacing now are valid characters.
Besides that, a string.Replace replaces all the characters in the given text, not sure if you would want that.
I've gotta draw a line under this work and implement changes to make this app usable. So, screw the RFC, let's boil it down to the absolute basics and define a list of characters that can appear anywhere in the alias:
a-z
A-Z
0-9
_
-
+
And additional alias rules:
.
may not appear in the first or last position and also may not appear consecutively (i.e. 2 periods side by side)Going back to @nhairs's analysis of major mail providers, none of them support characters that aren't included in the list above. By the time you take out Gmail, Outlook / Hotmail, Yahoo, you're not left with much. By example, the MediaWorks breach loaded a few days ago had 163k breached addresses and 120k were on those mail providers alone. In that data set was a grand total of 4 aliases that were exceptions to the above criteria:
3 of those were on the mail providers mentioned above that explicitly disallow those strings in the address and the 4th was on xtra.co.nz (I don't know if they permit those characters or not). Even if every single one of those was actually legitimate and incorrectly got rejected, we're looking at a 0.002% false-positive rate.
Can anyone identify any practical exceptions to these rules? By "practical" I mean characters that are commonly used and broadly acceptable by both email providers and websites. I want to reiterate that the sole purpose of this project is to extract email addresses from strings in a data breach; it's far more likely that a valid address is surrounded by junk than it is that an obscure RFC-compliant character is part of a legitimate address.
@troyhunt for subaddressing, maybe +
and -
(-
is less common for subaddressing, but may be used for other purposes like "john-doe@") should also be included. (Reference)
Going back to @nhairs's analysis of major mail providers, none of them support characters that aren't included in the list above.
Probably a small corner case, but you can make an alias that starts with {
on Google Workspace and send/receive using it, using your own domain.
@troyhunt for subaddressing, maybe
+
and-
(-
is less common for subaddressing, but may be used for other purposes like "john-doe@") should also be included. (Reference)
Ugh, markdown somehow cannibalised my original points 5 and 6, the dash character is already included. Doesn't matter whether it's used for subaddressing or not, it's allowed.
Going back to @nhairs's analysis of major mail providers, none of them support characters that aren't included in the list above.
Probably a small corner case, but you can make an alias that starts with
{
on Google Workspace and send/receive using it, using your own domain.
Yuck! Will keep it in mind, let's see how the rest of the feedback goes but yeah, I'm feeling that's far more likely to be junk than a legitimate address.
So, you can also start an alias with .
on Google Workspaces... and at least within Google, you can send using it. Attempting to reply to it errors out however.
Haven't tested sending to or receiving from outside Google.
I did very similar to @nhairs above, I used to keep a service provider set of rules. When processing I'd look at the domain and apply the standardization rules for that domain. Unrecognized domains were cause for an alert (after a while you've seen all the main ones). On the list, but never completed, was to do an MX lookup for vanity domains. Since we were in marketing, the decision was made that if we overscrubbed an address from some really minor provider, oh well.
I am strongly in favor of the RFC being reviewed, IIRC technically upper and lower case should be treated differently.
https://www.jochentopf.com/email/chars.html has some considerations around this, coming to pretty much the same conclusion as above.
I've seen =
being used for SRS though this perhaps isn't very relevant here.
https://www.jochentopf.com/email/chars.html has some considerations around this, coming to pretty much the same conclusion as above.
That's a great reference @The-Compiler and if you just take the "OK" characters, it aligns perfectly with my conclusions.
This is more of an open question than something I think we should immediately do:
I've just run this across a breach I'm loading now that has about 2.2M addresses found. Over 1k of them at the end of the file begin with one of the following characters:
My get feel is that these chars should be stripped and are no valid use cases where they should legitimately exist at the beginning of the address (or probably anywhere in the address). Just by way of example:
I'm more inclined to strip these than include them, what do we all think?