Create a standard email field verification Regular Expression (or find and verify one)

coreyshuman commented 6 years ago

https://en.wikipedia.org/wiki/Email_address

There are some crazy email addresses allowed in RFC 5321 and RFC 5322. Here is the above articles set of rules, and examples of valid and invalid addresses.

Local-part

The local-part of the email address may use any of these [[ASCII]] characters:

uppercase and lowercase [[Basic Latin (Unicode block)|Latin]] letters A to Z and a to z;
digits 0 to 9;
special characters !#$%&'*+-/=?^_`{|}~;
dot ., provided that it is not the first or last character unless quoted, and provided also that it does not appear consecutively unless quoted (e.g. code>John..Doe@example.com</code is not allowed but code>"John..Doe"@example.com</code is allowed);

Note that some mail servers wildcard local parts, typically the characters following a plus and less often the characters following a minus, so fred+bah@domain and fred+foo@domain might end up in the same inbox as fred+@domain or even as fred@domain. This can be useful for tagging emails for sorting, see below, and for spam control. Braces { and } are also used in that fashion, although less often.

space and code>"(),:;<>@[\]</code characters are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in addition, a backslash or double-quote must be preceded by a backslash);
comments are allowed with parentheses at either end of the local-part; e.g. code>john.smith(comment)@example.com</code and code>(comment)john.smith@example.com</code are both equivalent to code>john.smith@example.com</code.

In addition to the above ASCII characters, international characters above U+007F, encoded as [[UTF-8]], are permitted by RFC 6531, though even mail systems that support SMTPUTF8 and 8BITMIME may restrict which characters to use when assigning local-parts.

Domain

The [[domain name]] part of an email address has to conform to strict guidelines: it must match the requirements for a [[hostname]], a list of dot-separated [[DNS]] labels, each label being limited to a length of 63 characters and consisting of:{{rp|§2}}

uppercase and lowercase [[Basic Latin (Unicode block)|Latin]] letters A to Z and a to z;
digits 0 to 9, provided that top-level domain names are not all-numeric;
hyphen -, provided that it is not the first or last character. This rule is known as the ''LDH rule'' (letters, digits, hyphen). In addition, the domain may be an [[IP address]] literal, surrounded by square brackets [], such as code>jsmith@[192.168.2.1]</code or code>jsmith@[IPv6:2001:db8::1]</code, although this is rarely seen except in [[email spam]]. [[Internationalized domain name]]s (which are encoded to comply with the requirements for a [[hostname]]) allow for presentation of non-ASCII domains. In mail systems compliant with RFC 6531 and RFC 6532 an email address may be encoded as [[UTF-8]], both a local-part as well as a domain name.

Comments are allowed in the domain as well as in the local-part; for example, code>john.smith@(comment)example.com</code and code>john.smith@example.com(comment)</code are equivalent to code>john.smith@example.com</code.

Examples

Valid email addresses

code>simple@example.com</code
code>very.common@example.com</code
code>disposable.style.email.with+symbol@example.com</code
code>other.email-with-hyphen@example.com</code
code>fully-qualified-domain@example.com</code
code>user.name+tag+sorting@example.com</code (may go to code>user.name@example.com</code inbox depending on mail server)
code>x@example.com</code (one-letter local-part)
"very.(),:;<>[]\".VERY.\"very@\ \"very\".unusual"@strange.example.com
code>example-indeed@strange-example.com</code
code>admin@mailserver1</code (local domain name with no [[Top-level domain|TLD]], although ICANN [https://www.icann.org/news/announcement-2013-08-30-en highly discourages] dotless email addresses)
code>#!$%&'*+-/=?^_`{}|~@example.org</code
"()<>[]:,;@\\"!#$%&'-/=?^_`{}| ~.a"@example.org
code>example@s.example</code (see the [[List of Internet top-level domains]])
code>user@[2001:DB8::1]</code
" "@example.org (space between the quotes)

Invalid email addresses

Abc.example.com (no @ character)
code>A@b@c@example.com</code (only one @ is allowed outside quotation marks)
code>a"b(c)d,e:f;g<h>i[j\k]l@example.com</code (none of the special characters in this local-part are allowed outside quotation marks)
code>just"not"right@example.com</code (quoted strings must be dot separated or the only element making up the local-part)
this is"not\allowed@example.com (spaces, quotes, and backslashes may only exist when within quoted strings and preceded by a backslash)
this\ still\"not\allowed@example.com (even if escaped (preceded by a backslash), spaces, quotes, and backslashes must still be contained by quotes)
code>1234567890123456789012345678901234567890123456789012345678901234+x@example.com</code (local part is longer than 64 characters)
code>john..doe@example.com</code (double dot before @)
code>john.doe@example..com</code (double dot after @)

coreyshuman commented 6 years ago

This is a promising solution if we could figure out a way to standardize it for our projects. https://github.com/django/django/blob/master/django/core/validators.py#L164

Plus lots of examples and resources here: http://emailregex.com/

zbyte64 commented 6 years ago

Stackoverflow answer has a pretty awesome regexp pattern with a state machine diagram: https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression

But considering that different languages have different regexp syntaxes it might be better to designate a validation library for each language we use. For nodejs isemail looks pretty robust: https://github.com/hapijs/isemail/blob/master/test/tests.json

coreyshuman commented 6 years ago

I would like to humbly propose a solution which performs as well as the RFC5322 Official Standard (in my particular test set) but is much easier to understand and verify.

^(?!\.)(?!.*?\.(\.|@))[\w\d.!#$%&'*+\-\/=?^_`{|}~]+@[\w\d.-]+\.[\w\d]{2,}$

^ - start of line
(?!\.) - don't allow the line to start with .
(?!.*?\.(\.|@)) - don't allow consecutive periods, ex. (john..conor@test.com). Also don't allow a period at the end of the local part, ex (corey.@test.com)
[\w\d.!#$%&'*+\-\/=?^_`{|}~]+ - match one or more letters, numbers, and these special characters: .!#$%&'*+-/=?^_`{|}~
@ - match the literal character @
[\w\d.-]+ - match one or more letter, digit, period (.), or hyphen (-)
\. - match a period (.)
[\w\d]{2,} - match 2 or more letters and numbers
$ - end of line

This regex can be tested here: https://regex101.com/r/A9jZZ4/4 This is not meant to be a perfect solution, but should cover 99% of email addresses Shift3 would expect to deal with, while catching some basic mistakes for user convenience. It does NOT handle extended ASCII / international characters, which the RFC 5322 standard does.

The following email addresses expectedly pass this validation:

test@gmail.com
test@me.com
nan.b.dog@somewhere.haha
corey+1@shift3tech.com
very.common@example.com
disposable.style.email.with+symbol@example.com
other.email-with-hyphen@example.com
fully-qualified-domain@example.com
user.name+tag+sorting@example.com
x@example.com
1234567890123456789012345678901234567890123456789012345678901234+x@example.com

The following email addresses expectedly fail this validation:

corey.@test.com
@test.com
admin@mailserver1
"()<>[]:,;@\\\"!#$%&'-/=?^_`{}| ~.a"@example.org
user@[2001:DB8::1]
" "@example.org
.t@a.com
"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com
Abc.example.com 
A@b@c@example.com
a"b(c)d,e:f;g<h>i[j\k]l@example.com
just"not"right@example.com
this is"not\allowed@example.com
this\ still\"not\\allowed@example.com
john..doe@example.com
john.doe@example..com
corey@test.com.

I would appreciate if others would throw some other test cases against this regex and try to break it.

For reference, here is the RFC 5322 Standard I am comparing against.

^(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])$

Found at http://emailregex.com/

ggoforth commented 6 years ago

screen shot 2018-07-27 at 3 13 12 pm

But for reals, reading this issue is fantastic. I like the solution proposed at the end, and the amount of testing done against it. 👍

zbyte64 commented 6 years ago

Running through the validation examples from isemail against

^(?!\.)(?!.*?\.(\.|@))[\w\d.!#$%&'*+\-\/=?^_`{|}~]+@[\w\d.-]+\.[\w\d]{2,}$

Most notable is the lack of UTF8 support and hyphen handling.

False positives:

test@iana.123
a@abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefg.abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefg.abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefg.abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefg.abcdefghijkl.hijk
test@-iana.org
test@iana-.com
test@.iana.org

False negatives:

êjness@iana.org
ñoñó1234@something.com
test@\uD800\uD800ñoñó郵件ñoñó郵件.ñoñó郵件ñoñoñó郵件ñoñó.郵件ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件.oñó郵件ñoñó郵件ñoñó郵件.商務

coreyshuman commented 5 years ago

Hyphen support I'm not as concerned with, in terms of hitting that balance of simplicity vs. complete accuracy to RFC 5322. A false positive is not a big deal, vs a false negative which would stop a valid user from accessing a service. With that in mind, the false negatives do seem like a problem. How common is UTF8 support with the major email providers? And what percentage of users would hit that use case? If we're talking < 1 %, I would rather just tell a user to use a different email address.

Let me know what you guys think.

stephengtuggy commented 5 years ago

Personally, I've known people from multiple people groups in various parts of the world, and as far as I recall, almost all of them used plain ANSI characters in their email addresses, web addresses, and IM'ing. So I don't think UTF-8 support is a big deal.

zbyte64 commented 5 years ago

Frankly, I think it is more important to adopt a library for this concern then to bless a regex to be copied for all projects. Having a small clever regex pattern to stamp out is cool but it runs afoul with the DRY principle: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself. The argument for simplicity makes more sense if we're the one's maintaining the code, which for something as common as email validation, can we not?

Emoji is another reason to support UTF8: https://medium.com/@zackbloom/i-have-a-unicode-email-address-fbecd630ec12

If we're good at out jobs, our software should live to see a day when UTF-8 is more common in email addresses. Since we're here to address email validation, let's do it so we don't have to again.

coreyshuman commented 5 years ago

I don't disagree. My goal in this particular task was to discover a good front-end validation for email which gives a user immediate feedback to avoid typos, not necessarily to vet and validate all possible correct email addresses (we can leave that to the 3rd party email service).

The issue I see with using someone else's library for this is that we support and develop for many frontend frameworks (ionic, react, .net mvc, nativescript, xamarin..... ) One library would not work across all of those. A regex line would.

I imagine this being the beginning of a shift3 internal library of common functions, which we could build out for all of our primary development . If these things were rolled into our own libraries, we'd be respecting DRY way more than we do nowadays (across projects, not necessarily per individual project).

@zbyte64 I'm definitely open for other suggestions as well. Let me know if there was a particular library you had in mind, or if there is something you're already doing on your projects that you really like.

coreyshuman commented 3 years ago

@michaelachrisco 3 years later and this is still a recurring issue in projects. Now that we have boilerplates to implement a standard, I think this is a good time to resurface this.

Now that we're supporting locale translation in the boilerplates, I think the UTF-8 argument has some more strength behind it.

I suspect for client-side validation we will still be served best by simple and permissive validation, as opposed to strict and technical. What do you think?

coreyshuman commented 3 years ago

Adding that I agree with Justin Schiff's assessment here:

@coreyshuman I would normally agree, but what i'm trying to make clear here is that complicated email regex is not the preferred pattern for signup or email validation anyway. Attempting to send an email to the address specified is. Provided a permissive regex, or none at all (or just asking the user to enter their email twice) while sending a confirmation email, is a 100% method to ensure you end up with a valid email address, and 100% method to make sure you have no false negatives.

When you run into an "edge case" in your complicated regular expression you have to do the follow -> find the fix, hope you don't implement a regression possibly in other untested parts of the regex -> backport to all running applications using the old regex -> make sure all old versions of applications are updated -> etc. etc. etc.

I think that have an email regex may be valuable for things other than sign up fields, but I want it to be clear that in my opinion for sign in/sign up this is not the preferred pattern of validation, nor does it enhance security.

Originally posted by @DropsOfSerenity in https://github.com/Shift3/standards-and-practices/issues/130#issuecomment-541257822

coreyshuman commented 3 years ago

I noticed we do have an example documented in best practices here: https://github.com/Shift3/standards-and-practices/tree/main/best-practices/development-tools/validation#code-example

For this to be a completed standard, we should include a definition for our goal on what should and shouldn't pass this validation. It should also include a set of unit tests to verify that goal.

Karvel commented 3 years ago

The current RegEx in the Angular boilerplate is the following:

/^[a-z0-9!#$%&'*+\/=?^_\`{|}~.-]+@[a-z0-9]([a-z0-9-])+(\.[a-z0-9]([a-z0-9-]*[a-z0-9])?)*$/i

For the test sets you provided above, all of the ones that should match do, and the commented out ones below that should fail pass.

        const failingValues: string[] = [
          // 'corey.@test.com', //
          '@test.com',
          // 'admin@mailserver1', //
          `"()<>[]:,;@\\\"!#$%&'-/=?^_\`{}| ~.a"@example.org`,
          'user@[2001:DB8::1]',
          '" "@example.org',
          '.t@a.com',
          '"very.(),:;<>[]".VERY."very@\\ "very".unusual"@strange.example.com',
          'Abc.example.com ',
          'A@b@c@example.com',
          'a"b(c)d,e:f;g<h>i[jk]l@example.com',
          'just"not"right@example.com',
          'this is"notallowed@example.com',
          'this still"not\\allowed@example.com',
          // 'john..doe@example.com', //
          'john.doe@example..com',
          'corey@test.com.',
        ];

I do have unit tests for the validator using the regular expression, but I can add the test sets as follows:

      describe('[Unit] EmailValidation validEmail() Required', () => {
        const urlValidator = EmailValidation.validEmail(true);
        const emailControl = new FormControl('');
        const matchingValues: string[] = [
          'test@gmail.com',
          'test@me.com',
          'nan.b.dog@somewhere.haha',
          'corey+1@shift3tech.com',
          'very.common@example.com',
          'disposable.style.email.with+symbol@example.com',
          'other.email-with-hyphen@example.com',
          'fully-qualified-domain@example.com',
          'user.name+tag+sorting@example.com',
          'x@example.com',
          '1234567890123456789012345678901234567890123456789012345678901234+x@example.com',
        ];

        const failingValues: string[] = [
          '@test.com',
         `"()<>[]:,;@\\\"!#$%&'-/=?^_\`{}| ~.a"@example.org`,
          'user@[2001:DB8::1]',
          '" "@example.org',
          '.t@a.com',
          '"very.(),:;<>[]".VERY."very@\\ "very".unusual"@strange.example.com',
          'Abc.example.com ',
          'A@b@c@example.com',
          'a"b(c)d,e:f;g<h>i[jk]l@example.com',
          'just"not"right@example.com',
          'this is"notallowed@example.com',
          'this still"not\\allowed@example.com',
          'john.doe@example..com',
          'corey@test.com.',
        ];

        it(`should return null if value matches a list of values that should work`, () => {
          matchingValues.forEach((value) => {
            emailControl.setValue(value);
            expect(urlValidator(emailControl)).toEqual(null);
          });
        });

        it(`should return { invalidEmail: 'Please enter a valid email.' } if value matches a list of values that should fail`, () => {
          failingValues.forEach((value) => {
            emailControl.setValue(value);
            const expectedValue = {
              invalidEmail: 'Please enter a valid email.',
            };
            expect(urlValidator(emailControl)).toEqual(expectedValue);
          });
        });
      });

We can decide if we want to keep the current RegEx, change it, and add the above test values.

Either way, the boilerplate also follows the recommendations that @DropsOfSerenity posted above: it requires confirming the email address and sends an activation email to that account.

michaelachrisco commented 3 years ago

@michaelachrisco 3 years later and this is still a recurring issue in projects. Now that we have boilerplates to implement a standard, I think this is a good time to resurface this.

Now that we're supporting locale translation in the boilerplates, I think the UTF-8 argument has some more strength behind it.

I suspect for client-side validation we will still be served best by simple and permissive validation, as opposed to strict and technical. What do you think?

@coreyshuman I agree with making validation simple and permissive as you stated. If we get too strict with the REGEX/standard, we may get quite a few false positives (I remember a few horror projects I worked on in the EDI world). Emojis are now valid email addresses. Its a strange world we live in.

I also like the example @Karvel shows by adding real email addresses to the unit tests for each of the valid/invalid emails. As time goes on, this list will naturally expand as we find a user with some strange valid email address that we will need to accommodate and we can just add that to the unit test/fix.

Most of the projects I have worked on in the past has stolen or use thee default MDN example here: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/input/email and called it a day.

/^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}
[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/

This, of course, leaves in bugs (like https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489) but it does seem to be "good enough" for most.

I feel like we could add unit tests to the examples https://github.com/Shift3/standards-and-practices/tree/main/best-practices/development-tools/validation#code-example but a better place would probably be in the boilerplate projects.

stephengtuggy commented 3 years ago

FWIW, I also agree with making validation simple and permissive. And with requiring confirmation emails. I think something like @Karvel 's regex or the MDN one @michaelachrisco mentioned would probably work well.

Shift3 / standards-and-practices