JoshData / python-email-validator

A robust email syntax and deliverability validation library for Python.
The Unlicense
1.14k stars 113 forks source link

Validating e-mails with display-name: John Doe <john@example.com> #116

Closed ThorstenEngel closed 1 year ago

ThorstenEngel commented 1 year ago

Hi,

in my use-case I need to validate the syntax of e-mails with display name. I think, https://www.rfc-editor.org/rfc/rfc5322#section-3.4 fully allows addresses like "John Doe <john@example.com>" or in my case something like "ACME Corp. <no-reply@acme.com>".

I did not find a way to verify these addresses with yours or any other library. It would be great if your library could validate this too ;-).

Warm regards, thorsten

JoshData commented 1 year ago

https://github.com/mailgun/flanker can do that (and you could combine it with this library). (We link to flanker at the top of our README.)

I think parsing display names could be a useful addition.

ThorstenEngel commented 1 year ago

Thanks, this helped!

salty-horse commented 9 months ago

Flanker's lack of maintenance (and dependency on unmaintained packages) is beginning to break in modern versions of Python (3.13 specifically.)

For extracting the email from the display name you can use Python's built-in email.utils.parseaddr.

JoshData commented 9 months ago

Great suggestion. 😀

JoshData commented 9 months ago

I was thinking of replacing flanker with parseaddr in the recommendation in the README, but I see the parseaddr is a little flaky with edge cases. Just from a minute of playing I see it drops parts of the input it doesn't like:

email.utils.parseaddr("Test <@x>")
('Test', '')

>>> email.utils.parseaddr("Test <a@xx>, X <b@b>")
('Test', 'a@xx')

So it's not something I would necessarily recommend to use with a strict validation tool like this library.

salty-horse commented 9 months ago

Flanker doesn't accept Test <@x>, and from Test <a@xx>, X <b@b> it extracts b@b.

For my specific use, I don't care much about those cases, so I think it's a good enough solution :)

JoshData commented 9 months ago

Fair point !

jplusc commented 4 months ago

Hi, Just wanted to let you know I just moved from v2.1.2 to the current git branch to test out the allow_display_name option and ran into a difference with it from the previouly mentioned workarounds.

i've been using the email.utils.parseaddr and then just sending the email portion to email_validator, but email.utils.parseaddr works with email address like sigma@pair.com (Kevin Martin) whereas email_validator raises the exception EmailSyntaxError: The part after the @-sign contains invalid characters: '(', ')', SPACE.

I know you may not want to handle emails in this format, but thought the difference should be documented somewhere.

thanks for all you do!

>>> import email.utils
>>> import email_validator
>>> from flanker.addresslib import address
>>>
>>> #parseaddr
>>> s = "sigma@pair.com (Kevin Martin)"
>>> email.utils.parseaddr(s)
('Kevin Martin', 'sigma@pair.com')
>>>
>>> #flanker
>>> address.parse(s).address
'sigma@pair.com'
>>>
>>> #email_validator
>>> email_validator.validate_email(s, allow_display_name = True, check_deliverability = False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python311\Lib\site-packages\email_validator\validate_email.py", line 124, in validate_email
    domain_name_info = validate_email_domain_name(domain_part, test_environment=test_environment, globally_deliverable=globally_deliverable)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\site-packages\email_validator\syntax.py", line 441, in validate_email_domain_name
    raise EmailSyntaxError("The part after the @-sign contains invalid characters: " + ", ".join(sorted(bad_chars)) + ".")
email_validator.exceptions_types.EmailSyntaxError: The part after the @-sign contains invalid characters: '(', ')', SPACE.
JoshData commented 4 months ago

>>> s = "sigma@pair.com (Kevin Martin)"

Huh. What I implemented follows RFC 2822's name <email> format:

name-addr       =       [display-name] angle-addr
angle-addr      =       [CFWS] "<" addr-spec ">" [CFWS] / obs-angle-addr
display-name    =       phrase

I'm not sure what the source of a email (name) format is. Is it commonly used?

jplusc commented 4 months ago

I'm not sure what the source of a email (name) format is. Is it commonly used?

It might just be a qmail or older postfix or freebsd thing.

I don't see it often, but when I do, and if it came from a message, it normally also has a Received header from either qmail or postfix. here is one I just happened to have handy: Received: by six.pairlist.net (Postfix, from userid 0) id CC6D26ED5C

And I used to rent a freebsd server and whenever I would use their mailing list functions, my outgoing emails would look like that as well. (but I don't know if they also used postfix or qmail to send them)

It's not super common, but common enough that I would have to work around the execptions, so I am prob going to just stick with email.utils.parseaddr. (I know it also has weird edgecase behavior, but it's mishandling of edgecases hasn't effected my dataset in a meaningful way yet)

Hmm, just stumbled across this (from 2014): https://wordtothewise.com/2014/12/friendly-email-addresses/ " parentheses isn't really a display name at all, rather it's a human readable comment. "

I thought I saw some other mention around here about ignoring comments in ()'s I've never seen anything other than a name or mailing list name in the parens.

ThorstenEngel commented 4 months ago

We recently had an e-Mail with the display name "TIERE (gemeinnütziger Verein) Max Müller". As it contains Umlaute and Brackets, it did not work with email.utils.formataddr (it did not add the necessary paranthesis). So I rewrote my code successfuly to replace formataddr((friendlyname, r_mail)) with

from email.headerregistry import Address

fullmail = str(Address(display_name=friendlyname, addr_spec=r_mail))

It looked to me as if email.headerreagistry is better maintained than email.utils. getaddresses worked in my cases.

JoshData commented 4 months ago

" parentheses isn't really a display name at all, rather it's a human readable comment. "

Ahha! That makes sense. Comments came up in #77. As fun as it has been to implement display names, I probably am not going to get motivated to support comments.

It looked to me as if email.headerreagistry is better maintained than email.utils.

Good to know!