Should an email address be able to begin with these characters?

troyhunt commented 1 year ago

This is more of an open question than something I think we should immediately do:

I've just run this across a breach I'm loading now that has about 2.2M addresses found. Over 1k of them at the end of the file begin with one of the following characters:

\
/
`
'
{
}
|
!
~

My get feel is that these chars should be stripped and are no valid use cases where they should legitimately exist at the beginning of the address (or probably anywhere in the address). Just by way of example:

I'm more inclined to strip these than include them, what do we all think?

NeilHanlon commented 1 year ago

Those are all completely valid characters to be used in email addresses, per RFC 5322. The only requirement for the local part of the domain with respect to first character is that it MUST not be a ..

That said, I am not really aware of any providers which permit this as an address or alias, so.. perhaps it's OK to drop them?

junderw commented 1 year ago

I might be inclined to include both stripped and non-stripped.

They are valid characters, but most email address validation will fail them.

djcrabhat commented 1 year ago

I believe only the . is restricted from being the first or last character of the local-part, so I think you've gotta take the weirdos.

"Without quotes, local-parts may consist of any combination of alphabetic characters, digits, or any of the special characters

! # $ % & ' * + - / = ? ^ _ ` . { | } ~

period (".") may also appear, but may not be used to start or end the local part, nor may two or more consecutive periods appear."

https://datatracker.ietf.org/doc/html/rfc3696#section-3

nemec commented 1 year ago

Is there a pattern in the domain portion of the address? If it's all some set of custom domains, maybe they are valid. If it's gmail/yahoo/hotmail/outlook check the service's sign up rules, where those names are likely invalid.

troyhunt commented 1 year ago

Those are all completely valid characters to be used in email addresses, per RFC 5322.

Perhaps, but I suspect that in this case those chars aren't a valid part of the address and are instead delimiters that have been inadvertently included. That leaves us with trading off between either strictly adhering to the RFC and excluding those addresses, or parsing them out and including them despite the strict RFC definition. Which is better?

troyhunt commented 1 year ago

I might be inclined to include both stripped and non-stripped.

That gets very messy as it doubles up on the addresses which then affects all sorts of counts and stats. I'm really reticent to do that and would prefer to pick one approach over the other.

troyhunt commented 1 year ago

Is there a pattern in the domain portion of the address? If it's all some set of custom domains, maybe they are valid. If it's gmail/yahoo/hotmail/outlook check the service's sign up rules, where those names are likely invalid.

No, lots of different domains, I'm highly confident we're looking at parsing issues, not deliberately funky addresses.

thompsonus commented 1 year ago

I just sent myself an email starting with { from Gmail to Exim. All of the servers involved (including gmail, exim, rspamd, and dovecot) had no issues whatsoever.

I'd also advocate for adhering to RFC 3696 wherever possible. I've seen issues when systems reject perfectly legitimate but rarely seen email addresses. One of my emails has the super-short format of x@x.yy. It's my real email, but is often rejected by web form validation not respecting the RFC.

thompsonus commented 1 year ago

For good measure, I also tested with Exchange (MS365) and Roundcube. No issues with either one. It seems like a definite edge case, but still worthwhile to allow all valid email addresses.

oxguy3 commented 1 year ago

If my email address is john.doe@example.com, I would like to be alerted if {john.doe@example.com appears in a data breach. That said, there is the tiny possibility that {john.doe@example.com is the correct address, so I think including both versions would be the best option.

jaimevisser commented 1 year ago

Even though these are valid characters, I feel more often then not they are probably going to be delimiters. If including them means I won't be alerted on a@b.c when you hit {a@b.c it'd be best to strip them.

I do think the question again becomes; should you handle this in this tool or somewhere else in the HIBP infrastructure?

kingthorin commented 1 year ago

I’m inclined to rephrase this: is it better to include characters that are almost certainly intended to be delimiters therefor excluding the correct address from HIBP, or strip the characters and include the correct address regardless of the RFC allowing them?

👆This. I think we have about five nines certainty that these are other peoples' failure to parse lists properly.

accidentaldeveloper commented 1 year ago

Do the email providers for these addresses support creating an address with the leading character? If not, then as written they could not possibly be correct and you can strip the character without fear.

nhairs commented 1 year ago

That leaves us with trading off between either strictly adhering to the RFC and excluding those addresses, or parsing them out and including them despite the strict RFC definition. Which is better?

Timely, I've just been asking similar questions on email around "do I respect the RFCs or do I act sensibly / force others to act sensibly". Before doing research I was of the opinion of "force them to act sensibly".

Do the email providers for these addresses support creating an address with the leading character?

Using https://mailchimp.com/resources/most-used-email-service-providers/ as a reference.

Google: No

Yahoo: No

Outlook: No

AOL: No

Proton: No

Apple: No

Mail.com: No

GMX: No

Zoho: No

Personal Conclusion

tldr: strip them

Given that none of the above major email providers let you signup using these characters, I think it is a fair assumption that these are fail delimiters. In the (probably rare) case that it is not a failed delimiter then you should have been sensible and unfortunately you will miss out on this HIBP notification.

That said, I think there might be cases were it legitimately appears later in the address (i.e. where email providers allow the <mailbox>(+<label>)@<domain> format).

Aside: this ends up being a rabbit hole if you try to normalise it as some providers let you use . in the address which are stripped out when determining the actual mailbox to deliver to - but not all providers. The RFC basically says it's up to the provider to how they handle it, so unless you are going to start checking MX records or probing SMTP server headers you can't really make any assumptions of what you can and can't normalise in an email address for sending.

The sending part is important in this context because (I assume) it will be used for notifications. Contrast that with searching on the web UI where you could have an index of normalised email addresses to help users who have taken advantage of things like . stripping or + labels from their email provider.

Questions

That gets very messy as it doubles up on the addresses which then affects all sorts of counts and stats. I'm really reticent to do that and would prefer to pick one approach over the other.

If there is a concern about users not receiving notifications could we (and by we I mean @troyhunt):

check if there are any subscribed emails that have such special characters
if there are funky emails, send notifications using both the stripped and raw emails, but only add the stripped emails to the database

troyhunt commented 1 year ago

That was a really useful set of screen caps @nhairs, thank you!

I think the only reasonable conclusion here is that due to the lack of support for these chars by mainstream email providers, the unlikelihood of anyone legitimately using them and the real world impact of an email address being missed, the chars should be stripped and the RFC be damned 🙂

That now makes this issue a feature request. I'll add some tests for this later, I need to go back an analyse precisely how those strings were formatted originally.

ndejong commented 1 year ago

You might consider implementing a feature switch that looks like --mode rfc3696 where this would allow users to have their RFC3696 strictness when wanted; setting it up this way makes it possible to implement other "modes" at a later time too.

troyhunt commented 9 months ago

I've just added a failing test for this (EmailAddressesInSingleQuotesWithTrailingSpaceIsExtracted) based on the following in the most recent breach I loaded:

This subsequently extracted an address that began with a single quote and ended with the correct .br TLD. Let's add single quotes to the list of chars in the issue above and get this done.

nhairs commented 9 months ago

Related: You may want to check for emails in the form of "asdf"@example.com (and potentially the single quote version).

These are valid emails according to the RFCs (though I'm not super clear on where they can still be used).

For more info I'd check out the RFCs in the comments for quoted_local_part in this library

troyhunt commented 9 months ago

Related: You may want to check for emails in the form of "asdf"@example.com (and potentially the single quote version).

I've just updated the readme with a "practical considerations" section that addresses this: https://github.com/HaveIBeenPwned/EmailAddressExtractor/blob/main/README.md

troyhunt commented 5 months ago

I've just commited https://github.com/HaveIBeenPwned/EmailAddressExtractor/commit/cafb50344484f38ff46c9aef3592c718862c2ada which is a kludgy fix for some common problems I was seeing. I've then added https://github.com/HaveIBeenPwned/EmailAddressExtractor/commit/e73dd180dc1696292f767c24fa0a088d44956df8 which caused a heap of dramas processing the (still alleged) AT&T breach to the point where I had to defer to the old script.

This all looks related to problems with the delimiters and my view on it now is that we just take a strict approach. Would be awesome if someone took a good shot at fixing this.

alberthoekstra commented 5 months ago

Why not using a regex? Emailregex.com says, based on the RFC5322, that some of the characters you are replacing now are valid characters.

Besides that, a string.Replace replaces all the characters in the given text, not sure if you would want that.

troyhunt commented 5 months ago

I've gotta draw a line under this work and implement changes to make this app usable. So, screw the RFC, let's boil it down to the absolute basics and define a list of characters that can appear anywhere in the alias:

a-z
A-Z
0-9
_
-
+

And additional alias rules:

. may not appear in the first or last position and also may not appear consecutively (i.e. 2 periods side by side)

Going back to @nhairs's analysis of major mail providers, none of them support characters that aren't included in the list above. By the time you take out Gmail, Outlook / Hotmail, Yahoo, you're not left with much. By example, the MediaWorks breach loaded a few days ago had 163k breached addresses and 120k were on those mail providers alone. In that data set was a grand total of 4 aliases that were exceptions to the above criteria:

string&
string$string
string&
string|

3 of those were on the mail providers mentioned above that explicitly disallow those strings in the address and the 4th was on xtra.co.nz (I don't know if they permit those characters or not). Even if every single one of those was actually legitimate and incorrectly got rejected, we're looking at a 0.002% false-positive rate.

Can anyone identify any practical exceptions to these rules? By "practical" I mean characters that are commonly used and broadly acceptable by both email providers and websites. I want to reiterate that the sole purpose of this project is to extract email addresses from strings in a data breach; it's far more likely that a valid address is surrounded by junk than it is that an obscure RFC-compliant character is part of a legitimate address.

terminalcommandnewsletter commented 5 months ago

@troyhunt for subaddressing, maybe + and - (- is less common for subaddressing, but may be used for other purposes like "john-doe@") should also be included. (Reference)

robrankin commented 5 months ago

Going back to @nhairs's analysis of major mail providers, none of them support characters that aren't included in the list above.

Probably a small corner case, but you can make an alias that starts with { on Google Workspace and send/receive using it, using your own domain.

troyhunt commented 5 months ago

@troyhunt for subaddressing, maybe + and - (- is less common for subaddressing, but may be used for other purposes like "john-doe@") should also be included. (Reference)

Ugh, markdown somehow cannibalised my original points 5 and 6, the dash character is already included. Doesn't matter whether it's used for subaddressing or not, it's allowed.

troyhunt commented 5 months ago

Going back to @nhairs's analysis of major mail providers, none of them support characters that aren't included in the list above.

Probably a small corner case, but you can make an alias that starts with { on Google Workspace and send/receive using it, using your own domain.

Yuck! Will keep it in mind, let's see how the rest of the feedback goes but yeah, I'm feeling that's far more likely to be junk than a legitimate address.

robrankin commented 5 months ago

So, you can also start an alias with . on Google Workspaces... and at least within Google, you can send using it. Attempting to reply to it errors out however.

Haven't tested sending to or receiving from outside Google.

rjdudley commented 5 months ago

I did very similar to @nhairs above, I used to keep a service provider set of rules. When processing I'd look at the domain and apply the standardization rules for that domain. Unrecognized domains were cause for an alert (after a while you've seen all the main ones). On the list, but never completed, was to do an MX lookup for vanity domains. Since we were in marketing, the decision was made that if we overscrubbed an address from some really minor provider, oh well.

I am strongly in favor of the RFC being reviewed, IIRC technically upper and lower case should be treated differently.

The-Compiler commented 5 months ago

https://www.jochentopf.com/email/chars.html has some considerations around this, coming to pretty much the same conclusion as above.

I've seen = being used for SRS though this perhaps isn't very relevant here.

troyhunt commented 5 months ago

https://www.jochentopf.com/email/chars.html has some considerations around this, coming to pretty much the same conclusion as above.

That's a great reference @The-Compiler and if you just take the "OK" characters, it aligns perfectly with my conclusions.

HaveIBeenPwned / EmailAddressExtractor